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BLOCK ALGORITHMS WITH AUGMENTED RAYLEIGH-RITZ PROJECTIONS 
FOR LARGE-SCALE EIGENPAIR COMPUTATION 

ZAIWEN WEN* AND YIN ZHANG § 


Abstract. 

Most iterative algorithms for eigenpair computation consist of two main steps: a subspace update (SU) step 
that generates bases for approximate eigenspaces, followed by a Rayleigh-Ritz (RR) projection step that extracts 
approximate eigenpairs. So far the predominant methodology for the SU step is based on Krylov subspaces that 
builds orthonormal bases piece by piece in a sequential manner. In this work, we investigate block methods in the 
SU step that allow a higher level of concurrency than what is reachable by Krylov subspace methods. To achieve a 
competitive speed, we propose an augmented Rayleigh-Ritz (ARR) procedure and analyze its rate of convergence 
under realistic conditions. Combining this ARR procedure with a set of polynomial accelerators, as well as utilizing 
a few other techniques such as continuation and deflation, we construct a block algorithm designed to reduce the 
number of RR steps and elevate concurrency in the SU steps. Extensive computational experiments are conducted 
in Matlab on a representative set of test problems to evaluate the performance of two variants of our algorithm in 
comparison to two well-established, high-quality eigensolvers ARPACK and FEAST. Numerical results, obtained on 
a many-core computer without explicit code parallelization, show that when computing a relatively large number 
of eigenpairs, the performance of our algorithms is competitive with, and frequently superior to, that of the two 
state-of-the-art eigensolvers. 


1. Introduction. For a given real symmetric matrix A £ R" x ", let Ai, A 2 , • • • , A n be 
the eigenvalues of A sorted in an descending order: Ai > A 2 > • • • > X n , and qi,..., q n £ 
R" be the corresponding eigenvectors such that Aqi = Xiqi, ||(?i ||2 = 1, i = 1, ,71 and 
qf qj = 0 for i ^ j. The eigenvalue decomposition of A is defined as A = Q n A„Q^, where, 
for any integer i £ [1, n], 

(1.1) Qi = [qi, 92 , eK" xi , A i = diag(A 1 ,A 2) ...,A i ) eR ,xi , 

where diag(-) denotes a diagonal matrix with its arguments on the diagonal. For simplicity, 
we also write A = QAQ T where Q = Q n and A = A n . In this paper, we consider A to be 
large-scale, which usually implies that A is sparse. Since eigenvectors are generally dense, 
in practical applications, instead of computing all n eigenpairs of A, it is only realistic to 
compute k -C n eigenpairs corresponding to k largest or smallest eigenvalues of A. Fortu¬ 
nately, these so-called exterior (or extreme) eigenpairs of A often contain the most relevant 
or valuable information about the underlying system or dataset represented by the matrix A. 
As the problem size n becomes ever larger, the scalability of algorithms with respect to k has 
become a critical issue even though k remains a small portion of n. 

Most algorithms for computing a subset of eigenpairs of large matrices are iterative in 
which each iteration consists of two main steps: a subspace update step and a projection step. 
The subspace update step varies from method to method but with a common goal in finding 
a matrix X £ MA yk so that its column space is a good approximation to the /,•-dimensional 
eigenspace spanned by k desired eigenvectors. Once X is obtained and orthonormalized, the 
projection step, often referred to as the Rayleigh-Ritz (RR) procedure, aims to extract from 
X a set of approximate eigenpairs (see more details in Section [2} that are optimal in a sense. 
More complete treatments of iterative algorithms for computing subsets of eigenpairs can be 
found, for example, in the books fTl fl~6lI2T1 l3i l26l . 
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At present, the predominant methodology for subspace updating is still Krylov subspace 
methods, as represented by Lanczos type methods l9l lT2ll for real symmetric matrices. These 
methods generate an orthonormal matrix X one (or a few) column at a time in a sequential 
mode. Along the way, each column is multiplies by the matrix A and made orthogonal to all 
the previous columns. In contrast to Krylov subspace methods, block methods, as represented 
by the classic simultaneous subspace iteration method m, carry out the multiplications of 
A to all columns of X at the same time in a batch mode. As such, block methods generally 
demand a lower level of communication intensity. 

The operation of the sparse matrix A multiplying a vector, or SpMV, used to be the most 
relevant complexity measure for algorithm efficiency. As Krylov subspace methods generally 
tend to require considerably fewer SpMVs than block methods do, they had naturally become 
the methodology of choice for the past a few decades even up to date. However, the evolution 
of modern computer architectures, particularly the emergence of multi/many-core architec¬ 
tures, has seriously eroded the relevance of SpMV (and arithmetic operations in general) as 
a leading complexity measure, as communication costs have, gradually but surely, become 
more and more predominant. 

The purpose of this work is to construct, analyze and test a framework for block al¬ 
gorithms that can efficiently, reliably and accurately compute a relatively large number of 
exterior eigenpairs of large-scale matrices. The algorithm framework is constructed to take 
advantages of multi/many-core or parallel computers, although a study of parallel scalability 
itself will be left as a future topic. It appears widely accepted that a key property hindering 
the competitiveness of block methods is that their convergence can become intolerably slow 
when decay rates in relevant eigenvalues are excessively flat. A central task of our algorithm 
construction is to rectify this issue of slow convergence. 

Our framework starts with an outer iteration loop that features an enhanced RR step 
called the augmented Rayleigh-Ritz (ARR) projection which can provably accelerate con¬ 
vergence under mild conditions. For the SU step, we consider two block iteration schemes 
whose computational cost is dominated by block SpMVs: (i) the classic power method ap¬ 
plied to multiple vectors without periodic orthogonalization, and (ii) a recently proposed 
Gauss-Newton method. For further acceleration, we apply our block SU schemes to a set of 
polynomial accelerators, say p(A), aiming to suppress the magnitudes of p(Xj) where Xj’s 
are the unwanted eigenvalue of A for j > k. In addition, a deflation scheme is utilized to 
enhance the algorithm’s efficiency. Some of these techniques have been studied in the litera¬ 
ture over the years (e.g. l20l[29l on polynomial filters), and are relatively well understood. In 
practice, however, it is still a nontrivial task to integrate all the aforementioned components 
into an efficient and robust eigensolver. For example, an effective use of a set of polynomial 
filters involves the choice of polynomial types and degrees, and the estimations of intervals 
in which eigenvalues are to be promoted or suppressed. There are quite a number of choices 
to be made and parameters to be chosen that can significantly impact algorithm performance. 

Specifically, our main contributions are summarized as follows. 

1. An augmented Rayleigh-Ritz (ARR) procedure is proposed and analyzed that prov¬ 
ably speeds up convergence without increasing the block size of the iterate matrix 
X in the SU step (thus without increasing the cost of SU steps). This ARR proce¬ 
dure can significantly reduce the number of RR projections needed, at the cost of 
increasing the size of a few RR calls. 

2. A versatile and efficient algorithmic framework is constructed that can accommodate 
different block methods for subspace updating. In particular, we revitalize the power 
method as an exceptionally competitive choice for a high level of concurrency. Be¬ 
sides ARR, our framework features several important components, including 
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• a set of low-degree, non-Chebyshev polynomial accelerators that seem less 
sensitive to erroneous intervals than the classic Chebyshev polynomials; 

• a bold stoping rule for SU steps that demands no periodic orthogonalizations 
and welcomes a (near) loss of numerical rank. 

With regard to the issue of basis orthogonalization, we recall that in traditional block 
methods such as the classic subspace iteration, orthogonalization is performed either at every 
iteration or frequently enough to prevent the iterate matrix X from losing rank. On the con¬ 
trary, our algorithms aim to make X numerically rank-deficient right before performing an 
RR projection. 

The rest of this paper is organized as follows. An overview of relevant iterative algo¬ 
rithms for eigenpair computation is presented in Section [2] The ARR procedure and our 
algorithm framework are proposed in Section [3] We analyze the ARR procedure in Section 
Q] The polynomial accelerators used by us are given in Section 0 A detailed pseudocode for 
our algorithm is outlined in Section^ Numerical results are presented in Section[7] Finally, 
we conclude the paper in Section[8] 

2. Overview of Iterative Algorithms for Eigenpair Computation. Algorithms for 
eigenvalue problem have been extensively studied for decades. We will only briefly review a 
small subset of them that are most closely related to the present work. 

Without loss of generality, we assume for convenience that A is positive definite (after 
a shift if necessary). Our task is to compute k largest eigenpairs (Qk, A k) for some k <C n 
where by definition AQk = Qk^-k and (/[.Qk = I £ M. k ' xk . Replacing A by a suitable 
function of A, say Ai I — A, one can also in principle apply the same algorithms to finding k 
smallest eigenpairs as well. 

An RR step is to extract approximate eigenpairs, called Ritz-pairs, from a given matrix 
Z £ JR" x m whose range space, 7Z(Z), is supposedly an approximation to a desired Tri¬ 
dimensional eigenspace of A. Let orth (Z) be the set of orthonormal bases for the range 
space of Z. The RR procedure is described as Algorithm|T]below, which is also denoted by a 
map (Y, E) = RR(A, Z) where the output (Y , E) is a Ritz pair block. 


Algorithm 1: Rayleigh-Ritz procedure: ( Y. E) = RR ( A. Z) 

1 Given Z £ 1RT xm , orthonormalize Z to obtain U £ orth (Z). 

2 Compute H = U T AU £ R mxm , the projection of A onto orth(.Z'). 

3 Compute the eigen-decomposition H = V r YV, where V T V = I and E is diagonal. 

4 Assemble the Ritz pairs (Y, E) where Y = UV £ R raxm satisfies Y T Y = I. 


It is known (see for example) that Ritz pairs are, in a certain sense, optimal approx¬ 
imations to eigenpairs in 1Z(Z), the column space of Z. 

2.1. Krylov Subspace Methods. Krylov subspaces are the foundation of several state- 
of-the-art solvers for large-scale eigenvalue calculations. By definition, for given matrix A £ 
K" x ” and vector v £ K ra , the Krylov subspace of order k is span{u, Av, A 2 v ,..., A k_1 v}. 
Typical Krylov subspace methods include Arnoldi algorithm for general matrices (e.g., mu 
QU) and Lanczos algorithm for symmetric (or Hermitian) matrices (e.g., 12311X01 ). In either 
algorithm, orthonormal bases for Krylov subspaces are generated through a Gram-Schmidt 
type process. Jacobi-Davidson methods (e.g., [(21 1241 ) are based on a different framework, but 
they too rely on Krylov subspace methodologies to solve linear systems at every iteration. 

As is mentioned in the introduction, Krylov-subspace type methods are generally most 
efficient in terms of the number of SpMVs (sparse matrix-dense vector multiplications). In- 
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deed, they remain the method of choice for computing a small number eigenpairs. However, 
due to the sequential process of generating orthonormal bases, Krylov-subspace type methods 
incur a low degree of concurrency, especially as the dimension k becomes relatively large. 
To improve concurrency, multiple-vector versions of these algorithms have been developed 
where each single vector in matrix-vector multiplication is replaced by a small number of 
multiple vectors. Nevertheless, such a remedy can only provide a limited relief in the face 
of the inherent scalability barrier as k grows. Another well-known limitation of Krylov sub¬ 
space methods is the difficulty to warm-start them from a given subspace. Warm-starting is 
important in an iterative setting in order to take advantages of available information computed 
at previous iterations. 

2.2. Classic Subspace Iteration. The simple (or simultaneous) subspace iteration (SSI) 
method (see El HU [25] ED, for example) extends the idea of the power method which com¬ 
putes a single eigenpair corresponding to the largest eigenvalue (in magnitude). Starting from 
an initial (random) matrix U, SSI performs repeated matrix multiplications AU, followed by 
periodic orthogonalizations and RR projections. The main purpose of orthogonalization is 
to prevent the iterate matrix U from losing rank numerically. In addition, since the rates of 
convergence for different eigenpairs are uneven, numerically converged eigenvectors can be 
deflated after each RR projection. A version of SSI algorithm is presented as Algorithm [2] 
below, following the description in l26l . 


Algorithm 2: Subspace Iteration 

1 Initialize orthonormal matrix U G K" xm with m = k + q > k. 

2 while the number of converged eigenpairs is less than k, do 

3 

while convergence is not expected, do 

4 


while the columns ofU are sufficiently independent, do 

5 


^ Compute U = AU 

6 


Orthogonalize the columns of U. 

7 

Perform an RR step using U. 

8 

Check convergence and deflate. 


In the above SSI framework, q extra vectors, often called guard vectors, are added into 
iterations to help improve convergence at the price of increasing the iteration cost. 

A main advantage of SSI is the use of simultaneous matrix-block multiplications instead 
of individual matrix-vector multiplications. It enables fast memory access and highly par- 
allelizable computation on modern computer architectures. Furthermore, SSI method has a 
guaranteed convergence to the largest k eigenpairs from any generic starting point as long as 
there is a gap between the fc-th and the (k + l)-th eigenvalues of A. As is points out in l26l . 
“combined with shift-and-invert enhancement or Chebyshev acceleration, it sometimes wins 
the race”. However, a severe shortcoming of the SSI method is that its convergence speed de¬ 
pends critically on eigenvalue distributions that can, and often does, become intolerably slow 
in the face of unfavorable eigenvalue distributions. Thus far, this drawback has essentially 
prevented the SSI method from being used as a computational engine to build robust, reliable 
and efficient general-purpose eigensolvers. 

2.3. Trace Maximization Methods. Computing a /."-dimensional eigenspace associated 
with k largest eigenvalues of A is equivalent to solving an orthogonality constrained trace 
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maximization problem: 

(2.1) max tr (X T AX), s.t ,X T X = I. 

This formulation can be easily extended to solving the generalized eigenvalue problem where 
X T X = I is replace by X T BX = I for a symmetric positive definite matrix B £ K nxn . 
When maximization is changed to minimization, one computes an eigenspace associated with 
k smallest eigenvalues. The algorithm TraceMin Il22l solves the trace minimization problem 
using a Newton type method. 

Some block algorithms have been developed based on solving (12.11) . include the locally 
optimal block preconditioned conjugate gradient method (LOBPCG) 0 and more recently 
the limited memory block Krylov subspace optimization method (LMSVD) | fl3l . At each 
iteration, these methods solve a subspace trace maximization problem of the form 

(2.2) Y = argmax {tr(X T AX) : X T X = 1, X £ 5} , 

jxeR”x fe 

where X £ S means that each column of X is in the given subspace S which varies from 
method to method. LOBPCG constructs S as the span of the two most recent iterates A 1 '> 
and X < - 1 \ and the residual at XW, which is essentially equivalent to 

(2.3) <S = span {x^, X®, AX (i) } , 

where the term AX il ' > may be pre-multiplied by a pre-conditioning matrix. In the LMSVD 
method, on the other hand, the subspace S is spanned by the current *-th iterate and the 
previous p iterates; i.e., 

(2.4) 5 = span {x X**" 1 *, ..., X^} , 

In general, the subspace S should be constructed such that the cost of solving (12.2b can be 
kept relatively low. The parallel scalability of these algorithms, although improved from that 
of Krylov subspace methods, is now limited by the frequent use of basis orthogonalizations 
and RR projections involving to x to matrices where m is the dimension of the subspace S 
(for example, m = 3k in LOBPCG). 

2.4. Polynomial Acceleration. Polynomial filtering has been used in eigenvalue com¬ 
putation in various ways (see, for example, mmmm. For a polynomial function 
p(t) : R —> R and a symmetric matrix with eigenvalue decomposition A = QAQ T , it 
holds that 

n 

(2.5) p(A) = Qp{A)Q T = ^2 p(Xi)qiqf, 

i= 1 

where p{A) = diag(/?(Ai), p{ A 2 ),..., p( A n )). By choosing a suitable polynomial function 
p(t) and replacing A by p(A), we can change the original eigenvalue distribution into a more 
favorable one at a cost. To illustrate the idea of polynomial filtering, suppose that p(t) is a 
good approximation to the step function that is one on the interval [A&, Ai] and zero other¬ 
wise. For a generic initial matrix X £ R nxfe , it follows from (12.51 ) that p(A)X « QkQ^X, 
which would be an approximate basis for the desired eigenspace. In practice, however, ap¬ 
proximating a non-smooth step function by polynomials is an intricate and demanding task 
which does not always lead to efficient algorithms. 
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For the purpose of convergence acceleration, the most often used polynomials are the 
Chebyshev polynomials (of the first kind), defined by the three-term recursion: 

(2.6) p d +i{t) = 2tp d (t) - p d -i(t), d> 1, 


where po (t) = 1 and pi (t) = t. Some recent works that use Chebyshev polynomials include 
ll29l l6l. for example. 

2.5. FEAST. The FEAST algorithm [[17] [28) is based on complex contour integrals for 
computing all eigenvalues in a given interval [a, b\ C R and their corresponding eigenvectors. 
It is equivalent to using a rational function filter in subspace iteration. 

Let C be the circle on the complex plane centered at c = with radius r = which 
can be parameterized by the function <p(t) — c + re L %( 1+t ) for t £ [—1, 3] where t 2 3 4 5 6 = —1 is 
the imaginary unit. By the Cauchy integral theorem, for any p ^ C 


j_ i j - d2 =j_ /■ _m_ 

2ttl J c z — p 27 tl J_i <fi(t) — p 


~ P 


dt 


1, if \p — c\ < r 

0, if \p — c\ > r ’ 


where the integral on [1, 3] has been equivalently transformed into [—1,1]. Applying a q- 
point Gauss-Legendre quadrature formula with weight-node pairs (w/ , t/), l = 1.2.... ,q, 
such that wi > 0 and ti € (—1,1), the above integral can be approximated by the rational 
function 


where (pi = (p(ti) and <j; = wi<p'(ti)/(2m). Since none of (pi’s is real and A is symmetric, 
the matrices (pil — A and <pil — A are all invertible for l = 1,2,... ,q. Therefore, 

(2.7) p(A) = ^2ai{(pil - A) -1 - A)^ 1 

i=i i=i 

is a rational function filter approximating a desired step function on the real line. The appli¬ 
cation of this filter to X £ R r ' ,xm , i.e., computing p(A)X, will require solving q (since all 
quantities involved are real) linear systems of equations with m right-hand sides each. It is 
notable that these linear systems could be solved independently in parallel. 

In order to compute all eigenpairs in an interval [a, b ], FEAST need to estimate the number 
of eigenvalues in the interval [a, b\. It repeatedly applies the rational filter X = p(A)X, 
followed by an RR projection. A high-level summary of the FEAST algorithm is presented as 
Algorithm^ 


07 _ (Ti \ 

-p (pl-p) 


Algorithm 3: A abstract version of FEAST 

1 Input [a, b] and m - estimated number of eigenvalues in [a, b\. 

2 Choose a Gauss-Legendre quadrature formula with q nodes. 

3 Initialize a matrix X £ M nxm . 

4 while not “converged”, do 

5 Compute X = p{A)X with p(-) given in (12.7k 

6 Do RR projection using X to extract Ritz pairs. 
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It should be clear that the performance of FEAST depends strongly on the efficiency of 
solving the linear systems of equations involved in applying the rational filter p(A) to X. 
In addition, in order to compute the k largest eigenpairs, for example, one need to supply 
FEAST with an interval [a,b] D [A*, Ai]. The quality of this interval [a, 6] could have a 
significant effect on the performance of FEAST. 

2.6. A Gauss-Newton Algorithm. A Gauss-Newton (GN) algorithm is recently pro¬ 
posed in m to compute the eigenspace associated with k largest eigenvalues of A based on 
solving the nonlinear least squares problem: min || XX T — A|||,where X £ R” xfe , || • ||| is 
the Frobenius norm squared and A is assumed to have at least k positive eigenvalues. If the 
eigenpairs of A are required, then an RR projection must be performed afterwards. 

It is shown in fl4l that at any full-rank iterate X £ R” xfe , the GN method takes the 
simple closed form 



where the parameter a > 0 is a step size. Notably, this method requires to solve a small k x k 
linear system at each iteration. It is also shown in D3 that the fixed step a = 1 is justifiable 
from either a theoretical or an empirical viewpoint, which leads to a parameter-free algorithm 
given as Algorithm |4] named simply as GN. For more theoretical and numerical results on 
this GN algorithm, we refer readers to Di- 


Algorithm 4: A GN Algorithm: X = Gn(A, X) 

1 Initialize X sR nxt toa rank-fc matrix. 

2 while “the termination criterion ” is not met, do 

3 Compute Y = X (X T X) 1 and Z = AY. 

4 _ Compute X = Z - X(Y T Z - I)/ 2. 

5 Perform an RR step using X if Ritz-pairs are needed. 


3. Augmented Rayleigh-Ritz Projection and Our Algorithm Framework. We first 
introduce the augmented Rayleigh-Ritz or ARR procedure. It is easy to see that the RR map 
(Y, E) = RR(A, Z) is equivalent to solving the trace-maximization subproblem (12.21) with 
the subspace S = 1Z(Z), while requiring Y T AY to be a diagonal matrix E. For a fixed 
number k, the larger the subspace 7 Z(Z) is, the greater chance there is to extract better Ritz 
pairs. The classic SSI always sets Z to the current iterate X^\ while both LOBPCG Q 
and LMSVD lH3l augment X M by additional blocks (see (12.31 ) and (12.4b . respectively). Not 
surprisingly, such augmentations are the main reason why algorithms like LOGPCG and 
LMSVD generally achieve faster convergence than that of the classic SSI. 

In this work, we define our augmentation based on a block Krylov subspace structure. 
That is, for some integer p > 0 we define 


5 = span{X, AX, A 2 X,A P X}. 


(3.1) 


This choice (13.11 ) of augmentation is made mainly because it enables us to conveniently ana¬ 
lyze the acceleration rates induced by such an augmentation (see the next Section). It is more 
than likely that some other choices of S may be equally effective as well. 

The optimal solution of the trace maximization problem (12.21 ). restricted in the subspace 
S in ED. can be computed via the RR procedure, i.e.. Algorithm [T| We formalize our 
augmented RR procedure as Algorithm^ which will often be referred to simply as ARR. 
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Algorithm 5: ARR: (Y, E) = Arr(A, X,p) 

1 Input X e R" xfc and p > 0 so that (p + l)k < n. 

2 Construct augmentation X p = [X AX A 2 X ••• A P X], 

3 Perform an RR step using (Y, E) = RR(A, X p ). 

4 Extract k leading Ritz pairs (Y, E) from (Y, E). 


We next introduce an abstract version of our algorithmic framework with ARR projec¬ 
tions. It will be named ARRABIT (standing for ARR and block iteration). A set of polynomial 
functions {pd{t)}, where d is the polynomial degree, and an integer p > 0 are chosen at the 
beginning of the algorithm. At each outer iteration, we perform the two main steps: subspace 
update (SU) step and augmented RR (ARR) step. There are two sets of stopping criteria: 
inner criteria for the SU step, and outer criteria for detecting the convergence of the whole 
process. 

In principle, the SU step can be fulfilled by any reasonable updating scheme and it does 
not require orthogonalizations. In this paper, we consider the classic power iteration as our 
main updating scheme, i.e., for X = [x\ X 2 • • • , x m \ € R nxm , we do 

oc ■ 

Xi = p(A)xi and Xi = .. ' , j = 1, 2, • • • , m. 

IMI2 

Since the power iteration is applied individually to all columns of the iterate matrix X, we 
call this scheme multi-power method or MPM. Here we intensionally avoid to use the term 
subspace iteration because, unlike in the classic SSI, we do not perform any orthogonalization 
during the entire inner iteration process. 

To examine the versatility of the ARRABIT framework, we also use the Gauss-Newton 
(GN) method, presented in AlgorithmQ] as a second updating scheme. Since the GN variant 
requires solving k x k linear systems, its scalability with respect to k may be somewhat lower 
than that of the MPM variant. Together, we present our ARRABIT algorithmic framework in 
Algorithm^ The two variants, corresponding to “inner solvers” MPM and GN, will be named 
ARRABIT-MPM and ARRABIT-GN, or simply MPM and GN. 


Algorithm 6: Algorithm ARRABIT (abstract version) 

1 Input A e l" xn , k, p and p[t). Initialize X e M. nxk . 

2 while not “converged”, do 

3 while “inner criteria” are not met, do 

4 if MPM is the inner solver, then 

5 L X = p{A)X , then normalize columns individually. 

6 if GN is the inner solver, then 

7 L X = as * s gi ven by AlgorithmQ] 

8 ARR projection: ( X , E) = A RR( A. X. p), as in AlgorithmQ] 

9 Possibly adjust p, the degree of p(t), and perform deflation. 


It is worth mentioning that the “inner criteria” in the ARRABIT framework can have a 
significant impact on the efficiency of Algorithm [6] Against the conventional wisdom, we 
do not attempt to keep X numerically full rank by periodic orthogonalizations which can be 
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quite costly. Instead, we keep iterating until we detect that X is about to lose, or has just lost, 
numerical rank. More details on this issue will be given in Algorithm[8]in Section[6] 

4. Analysis of the Augmented Rayleigh-Ritz Procedure. 

4.1. Notation. Recall that the eigen-decomposition of A £ R” xra is A = QAQ T . In 
anticipation of later usage, for integer h £ [1, n) we introduce the partition Q = [Qh Qh+] 
where, as previously defined, Qh = [qi </2 • • • qh] and 


(4.1) Qh+ = [qh+i qh+i ■■■ Ini- 

Let X £ R" xfc be an approximate basis for lZ(Qk), the range space of Qk or the 
eigenspace spanned by the first k eigenvectors of A. It is desirable for X to have a large 
projection QkQ^X = i Qg! X onto lZ(Qk) relative to that onto lZ(Qk+)- Therefore, a 
good measure for the relative accuracy of X is the following ratio 


(4.2) 


Sk(X) 4 


max,;>fc HgfX'H 
min^fe \\qfX\\ 


where ||(jTX|| = ||(gi<j , ^ r )X r || measures the size of the projection of X onto the span of the 
*-th eigenvector q,. Clearly, the smaller 6k(X) is, the better is X as an approximate basis for 

n{Qk). 

Let Y £ R" xfc be another approximate basis for the eigenspace 7 Z(Qk) which is con¬ 
structed from X. To compare Y with X, we naturally compare Sk{Y) with 5k(X). More 
precisely, we will try to estimate the ratio 6k(Y)/dk{X) and show that under reasonable 
conditions, it can be made much less than the unity. 

To facilitate presentation, we introduce the following Vandermonte matrix constructed 
from the spectrum of A\ 


( 1 

Ai 

A? ■ 

• A? 

\ 


1 

A2 

Ai • 

• A£ 








g fl£7lX(p+l) 

V 1 

An 

\ 2 . 

■ A p 

'n 

) 



where Ai, • ■ • , A„ are the eigenvalues of A. 

4.2. Technical Results. Before calling the ARR procedure, we have an iterate matrix 
X £ R nxfe . From X, we construct the augmented matrix [X AX ■ ■ ■ A P X\ £ R rax (p+ 1 ) fc 
which we call X p for a given p > 0. In view of the eigen-decomposition A = QAQ T , we 
have the expression X p = QG where 


(4.4) G=[Q t X AQ t X A p Q t X]. 

We next normalize the rows of G. Let D be the diagonal matrix whose diagonal consists of 
the row norms of G. From the structure of G in (14.4k it is easy to see that 

(4-5) D ii = \\efd\\ = \\qfX\\\\efVl 

where e, is the z-th column of the n x n identity matrix and V is defined in (14.3k Let D t be 
the pseudo-inverse of D, that is, I) ' is a diagonal matrix with 


(4.6) 



if Da ^ 0, 
otherwise. 
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The normalization of the rows of G in ( 14.4b defines another matrix 

(4.7) G = D^G = [C AC A P C], 

where C = D'Q T X and the nonzero rows of G all have unit norm. Now we rewrite 

(4.8) X P = QDD ] G = QDG. 

Let to be a parameter varying in the following range: for p > 0 such that k + pk < n, 

(4.9) m € [k,k +pk). 


We perform the partition 
(4.10) X p = [Q m Q m+ ] 


' D t O' 


r Gi 1 

o d 2 


-1 

CM 

_ 1 


— [Qm Qm+] 


D rGi 

D2G2 


where D and G are partitioned following that of Q. In particular, G\ consists of the first m 
rows of G and G 2 the last n — m rows of G. 

In the sequel, we will make use of an important assumption on G\ £ R m x(p+ 1 ) fc which 
we formally name as the Gi-Assumption: 


(4.11) Gi-Assumption: the first to rows of G (or G) are linearly independent. 

The Gi-Assumption implies that (i) 1)\ > 0, and (ii) the pseudo-inverse G\ exists such that 
GlG 1 = Imxm - Let 


(4.12) Y p = X p G\D^ = [QmQm+\ D 2 q 2 q\ iD -i ■ 

In particular, we are interested in the first k columns of Y p , i.e., by Matlab notation, 

(4.13) Y = Yp(:,l:fc) £ R nxk . 


We summarize what we already have for Y into the following lemma. 

Lemma 4.1. Let A = QAQ T be the eigen-decomposition of A = A T £ R nxn . For 
integers k > 0 and p > 0 satisfying (p + 1 )k < n, and to £ [k,k + pk], let G, X p , Y p and 
Y be defined as in ( 14.71 ), ( 14.81 ), ( 14.12b and ( 14.131 ). respectively. Under the Gi-Assumption, 


(4.14) Y = Q m E k + Q m+ SE k , 

where S = D 2 G 2 G\Df 1 and E k £ R mxfe consists of the first k columns of the m x m 
identity matrix. 

Proof. The equality directly follows from (14.12b and (14.13b . □ 

Since Y is extracted from the subspace 1Z(X p ) constructed from X, a central question 
is how much improvement Y can provide over X as an approximate basis for lZ(Q k ). We 
study this question by comparing the accuracy measure 5 k (Y ) relative to 6 k (X). First, we 
estimate S k (Y). 

LEMMA 4.2. Under the conditions of Lemma \4. 1\ 


(4.15) 


6 k (Y) < max<>m f i max \\efG 2 G\E k \\. 
minjOj di l<i<n-m 


where d = diag(Z7) with Du defined in (14.5b . 
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Proof. It follows from ( 14. 14b that 

f ej, is[l,k] 
qfY = < 0 T , i E (k,m] 

{ ef_ m SE k , ie(m,n] 

where e, € R fc , 0 E R fc and ei_ m € R” _m . These formulas imply that in the definition (14.2b 
the denominator term mini<fc ||gfV|| = 1; thus 

(4.16) 400 = max \\qjY\\ = max \\q?Y\\. 

i>k i>m 


In view of the formula S = D 2 G 2 G\ /l, 1 , and the definition of D in (14.5L we have 


qjY = d i ef_ m G 2 G\D 1 1 E k , i E ( m,n .]. 


Therefore, for i E (m, n], ||gfy|| < 


ef_ m G 2 G\E k \\. It follows that 


max|| g fr|| < 

2>m 


max i>m dj 
minj< fe di 


max \\ejG 2 G\E k \\, 


which, together with (14. 1 6b . establishes (14. 1 5b . □ 

4.3. Main Results. We first extend the definition (14.2b for S k (X) into a more general 
form. For any matrix M of n rows, we define 


(4.17) 


r fc , m (Af) 


A 


maxj >m \\ef M\\ 
minj< fc \\ef M\\ 


By this definition, S k (X) = T k ,k(Q T X). 

It is worth observing that (i) T k ,m(M) is monotonically non-increasing with respect to 
to for fixed k and M; (ii) T k ^ m (M) is small if the first k rows of M are much larger in 
magnitude than the last n — to; (iii) if {\\ef M\\} is non-increasing, then T k ^ m (M) < 1. 

Specifically, since the eigenvalues of A are ordered in a descending order, for the matrix 
V in (14.3b we have 


(4.18) 


r k ,m{V) 


\\el + 1 V\\ 

\\e T k V\\ 


( 1 + ^m+l 4-I - ^m+1 \ 2 ^ . 

I i + aS + - + a? ) - ’ 


Evidently, the faster the decay is between X k and A m +i, the smaller is I4 m (F). 

Moreover, when M = z E R" is a vector which is in turn the element-wise multiplica¬ 
tion of two other vectors, say x E R" and y G R" so that z* = x t yi for i = 1, • • • , n, then it 
holds that 


(4.19) 


r k,m (^) ^ rr k,m (y)- 


In our first main result, we refine the estimation of 4(F) and compare it to 8 k (X). 
THEOREM 4.3. Under the conditions of Lemma \4. 1\ 


(4.20) 

Furthermore, 

(4.21) 


4(F) < r k , m (Q T X)T k , m {V) G\E k 


4(F) . ma x j>m \\qJX\ 


< 


S k (X) maxjxt \\qJX\\ 


T ki m(V) G\E k 
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Proof. Observe that the ratio in the right-hand side of ( 14.151 ) is none other than T km {d). 
Applying (14.19b to M = d where d = diag(Z?) with Da defined in <E3, Xi = \\qfX\\ and 
Vi = ||efV||, we derive T k ,m(d) < T k>m (Q T X)T k:m (y).We observe that \\e[G 2 G\E k \\ < 
||G^fc ||2 for all i £ [1 ,n — m], since the row vectors ef G 2 are all unit vectors. Substituting 
the above two inequalities into (14. 15b . we arrive at (14.20b . To derive (14.21b . we simply observe 
that 


r k , m (Q T x) = 


max i>m \\qJX\\ 
minj< fe \\qJX\\ 


= S k (X) 


max i>m \\qjX\\ 
ma,Xj >k \\qJX\\ 


Substituting the above into (14.20b and dividing both sides by S k (X), we obtain (14.21b . □ 

To put the above results into perspective, let us examine the right-hand side of (14.21b . 
Clearly, the first term, the ratio involving \\qj X ||’s, is always less than or equal to one since 
k < to, and it decreases as m increases. In particular, when m = k + 1 + pk with p > 0 
and a large k, then m k and the ratio can be tiny as long as there is a significant decay in 
{\\qj X ||}” =1 between indices k and to. In addition, from (14.18b . we know that the second 
termTfc.mlV) < 1 and can be far less than one if there is a large decay between X k and 
The third term ||G}i?fe|| 2 , however, presents a complicating factor. How this term behaves as 
p increases requires a scrutiny which will be the topic of Section [4~4l 

Similarly, we can examine the right-hand side of ( 14.20b in which only the first term 
is different. Given a good approximate basis X for which the row norms of Q T X have a 
nontrivial decay, we can also have T k ,m{Q T X) <C 1; and the faster the decay is, the smaller 
is the term T k}Tn (Q T X). Therefore, with the exception of the term || 2 , all the terms in 

the right-hand sizes of ( 14.20b and ( 14.21b are small under reasonable conditions. 

Next we consider the case where X £ R" x k is the result of applying a block power 
iteration q times to an initial random matrix Xq £ 


pnx/c 


(4.22) 


X = p(A)*X 0 = Qp(A) q Q T X 0 , 


where p{A) is a polynomial or rational matrix function accelerator (or filter) such that 
(4.23) min |p(Aj)| = |p(A fc )| > |p(A fc+ i)| > • • * > |p(A m+ i)| = max |p(Aj)|- 

1 < j</c m<j<n 

THEOREM 4.4. Let X be defined in ( 14.22b from an initial matrix Xq £ M. nxk . Assume 
that the conditions of Lemma \4. l\ hoid. Then there exists a constant c m such that 


S k (Y) < Cr, 


p(A m +i) 


p(Afc) 


(4.24) 
where 

(4.25) c m = T Krn (Q T X 0 )T Km (V) ^G\E k 
Moreover, there exists a constant c' such that 


(4.26) 
where 

(4.27) 


4 on 

S k (X) 


<cL 


P(Am+l) 


p(4+l) 




max J>ro ||gJX 0 || f 

—:- .. Ty .. 1 GjAfc 

mm j>k ||qj A 0 || 
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Proof. It follows from Q T X = p(A) q Q T X 0 that 


(4.28) 


hi x \\ = \p(^i)\ g hi x o\\, i = !,■■■ ,n. 


Applying ( 14. 19b to ( 14.28b . we obtain T k , m (Q T X ) < T k ,m(p(.A) q )T kjTn (Q T Xf) which es¬ 
tablishes (14.24b . upon substituting into (14.20b . 

To prove (14.26b . we first use (14.28b to calculate 


ma Xj >m ||gJX|| 
max i>fe \\qJX\\ 


maxj >m IpjX^WqJXpW 
max j>k |p(Aj)| 9 |kJX 0 || 


io(A m+ i) q max j>m ||gJX 0 || 
p(A fc+ i) mm j>k \\qjX 0 \\ 


Then substituting the above into (14.21b yields (14.26b . □ 

Let us also state a couple of special cases of (14.24b . 

COROLLARY 4.5. If the Gi-Assumption holds for m = k+pk, then there exist constants 
C p and Cp such that 


5 k (Y) < C p 



" and W < C' 


p(Afe) 

4(X) ^ 

p(Afc+i) 


In particular, when there is no augmentation (p = 0) and no acceleration (p(t) = t), the 
convergence rate reduces to 5 k (Y ) < Go |Afc+i/Afc| 9 . 

Finally, we remark that all of our results point out that there exists a matrix Y € E n x k 
in the augmented subspace TZ(X. P ) (which is constructed from the matrix X) that is a better 
approximate basis for 7 Z(Q k ) than X is, under reasonable conditions. It is known that the 
Ritz pairs produced by the RR procedure are optimal approximations to the eigenpairs of A 
from the input subspace (see m for example). Therefore, the derived bounds in this section 
should be attainable by the Ritz pairs generated by the ARR procedure. 

4.4. Validity of Gi -Assumption. A key condition for our results is the Gi-Assumption, 
given in (14. 1 lb . that requires the first m rows of G in (14.7b to be linearly independent. Under 
this assumption, the larger m is, the better the convergence rate could be. 

Let us examine the matrix Gi consisting of the first to rows of G in (14.7b . To simplify 
notation, we use H for Gi, redefine G as the first to row of G in (14.7b . and consider the matrix 

(4.29) H = [C A m C ••• A^G] S K mx(p+1)fc , 


where A m is the m x to leading block of A whose disgonal is assumed to be positive. 

We first give a necessary condition for the to rows of H to be linearly independent. 

Proposition 4.6. Let m e ( k , k + pk]for p > 0. The matrix H e j^mx(p+i)k d e fi nec i 
in ( 14.29b has full rank m only if A m has no more than k equal diagonal elements (i.e., A m 
contains no eigenvalue of multiplicity greater than k). 

Proof. Without loss of generality, suppose that the first k + 1 diagonal elements of A m 
are all equal, i.e., Ai = A 2 = ■ ■ • = Afc+i = a. Then the first k + 1 rows of H, say H', is of 
the form H' = \C' aC' ■ • ■ a p C'], where C' consists of the first k + 1 rows of G. Since 
all column blocks are scalar multiples of C' which has k columns, the rank of H is at most 
k. independent of to. □ 

The fact that H is built from G which has only k columns dictates that to have rank(ff) 
greater than k, it is necessary that the maximum multiplicity of A m must not exceed k. 

On the other hand, the next result says that when p = 1 and to reaches its upper bound 2k, 
a multiplicity equal to k is sufficient for H to attain the full rank 2k (i.e., to be nonsingular) 
in a generic case. 
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First, let us do the partitioning 

Ci A!Ci ' 

C 2 A 2 C 2 ’ 

where m = 2k, and Cj,Aj, j = 1,2, are all k x k submatrices. Recall that Ai consists of the 
first k eigenvalues of A and A 2 the next k eigenvalues. 

PROPOSITION 4.7. Let p = 1, m = 2k, and C, A m and H be defined as in d4.30l >. Let 
r be the maximum multiplicity of A m . Assume that any k x k submatrix ofC is nonsingular. 
Then H is nonsingular for r = k. 

Proof. We will show that when Ai or Afc+i has multiplicity k, then H is nonsingular. 
All the other cases can be similarly proven with appropriate permutations before partitioning 
(14.30b is done. 

First, the nonsingularity of H is equivalent to that of 


' Ci 

AiCx ' 

■ cr 1 



1 

Ai 


I 

Ai 

c 2 

A 2 C 2 


cr 1 . 


C 2 C~ l 

a 2 c 2 c- 1 


F 

A 2 F 


where F = C 2 C 1 1 is nonsingular by our assumption. Eliminating the (2,1) block, we obtain 

' I Ai 
F A 2 F 

Hence, the nonsingularity of // is equivalent to that of FA\ — A 2 F, or in turn equivalent to 
that of the following matrix: 

(4.31) K = Ai - F~ 1 A 2 F. 

If the multiplicity of Ai is k (implying that Ai = A&/), (14.311) reduces to K = F~ l {X k I — 
A 2 )F. On the other hand, if the multiplicity of Afc+i is k (implying that A 2 = X k +il), then 
K = Ai — Afc+i I. In either case, K is nonsingular since Afc+i < A k \ hence, so is F[. (Also 
in either case, K becomes singular for multiplicity r > k which implies A &+1 = A k .) □ 

In Proposition 14.71 we assume that every k x k submatrix of C is nonsingular. It is 
well-known that for a generic random matrix C, this assumption holds with high probability. 
Therefore, in a generic setting Proposition l4.7l holds with high probability. 

Now the unproven case is for maximum multiplicity r < k. Let us rewrite K in (14.31b 
into a sum of two matrices, 

(4.32) K = (A 1 -X k I) + F~ 1 {X k I-A 2 )F. 

The first is diagonal and positive semidefinite, and the second has positive eigenvalues when 
Afc > Afc+i, but is generally asymmetric. So far, we have not been able to find a result that 
guarantees nonsingularity for such a matrix K. However, in a generic setting where K comes 
from random matrices, nonsingularity should be expected with high probability (which has 
been empirically confirmed by our numerical experiments). 

It should be noted that G 1 being nonsingular with m = k + kp represents the best 
scenario where the acceleration potential of p-block augmentation is fully realized. However, 
m < k + kp does not represent a failure, considering the fact that as long as rn > k, an 
acceleration is still realized to some extent. 

Once it is established for p = 1 and m = 2k that in a generic setting II is nonsingular 
whenever the maximum multiplicity of A m is less than or equal to k, the same result can in 
principle be extended to the case of p = 3 by considering 

H = [C AC A 2 C A 3 C] = [[C AC] A 2 [C AC]] = [C AC], 


I Ax 
0 A 2 F — F A\ 


(4.30) 


C = 


Cl 

c 2 


A™ = 


Ai 


A 2 


H = 
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where C = [C AC] and A = A 2 , which has the same form as for the case p = 1 . It will also 
cover the case of p = 2 where the matrix involved is a submatrix of the one for p = 3. 

It is worth noting that m = (jp + l)k could be kept constant if k is decreased while 
p is increased. Is it sensible to use fewer vectors in power iterations but to compensate it 
with an augmentation of more blocks? Although in some cases this strategy works well, in 
general it seems to be a risky approach for two reasons. First, the smaller k is, the more 
likely it is to encounter matrices that have eigenvalues of multiplicity greater than k. In this 
case, by Proposition 14.61 the benefit of augmentation could become limited. Secondly, we 
have observed in numerical experiments that the condition number of G\ tends to increase as 
p increases, which would in turn increase the constants c rn and c' m in (14.20b - d4.21l ). These 
facts suggest that using a small k and a large p to compute more than k eigenpairs could be 
numerically problematic. In our implementation, we choose to be conservative by using the 
default value of p = 1 , while setting k to be slightly bigger than the number of eigenpairs to 
be computed. 

5. Polynomial Accelerators. To construct polynomial accelerators (or filters) p(t), we 
use Chebyshev interpolants on highly smooth functions. Chebyshev interpolants are polyno¬ 
mial interpolants on Chebyshev points of the second kind, defined by 

(5.1) tj = - cos(jn/N), 0 < j < N, 

where N > 1 is an integer. Obviously, this set of A’ + 1 points are in the interval [—1,1] 
inclusive of the two end-points. Through any given data values fj,j = 0,1, • • • , N, at these 
TV + 1 Chebyshev points, the resulting unique polynomial interpolant of degree N or less is 
a Chebyshev interpolant. It is known that Chebyshev interpolants are “near-best” 0 . 

Our choices of functions to be interpolated are 

(5.2) f d (t) = (fi(t)) d where /i(t) = max(0, t) 10 , 

and d is a positive integer. Obviously, f d (t) = 0 for t < 0 and fd{ 1) = 1. The power 10 is 
rather arbitrary and exchangeable with other numbers of similar magnitude without making 
notable differences. 

The functions in d5.2b are many times differentiable so that their Chebyshev interpolants 
converge relatively fast, see G3}. Interpolating such smooth functions on Chebyshev points 
helps reducing the effect of the Gibbs phenomenon and allows us to use relatively low-degree 
polynomials. 

There is a well-developed open-source Matlab package called Chebfun a for doing 
Chebyshev interpolations, among many other functionalities]]} In this work, we have used 
Chebfun to construct Chebyshev interpolants as our polynomial accelerators. Specifically, 
we interpolate the function fdit) by the d-th degree Chebyshev interpolant polynomial, say, 

(5.3) = 7i t d + 72 t d ~ X + ■■■ +"fdt + 7d+i- 

Suppose that we want to dampen the eigenvalues in an interval [a, b ], where a < A„ and 
b < A/c, while magnifying eigenvalues to the right of [a, b]. Then we map the interval [a, b] 
onto [—1,1] by an affine transformation and then apply ip d (‘) to A. That is, we apply the 
following polynomial function to A, 

(5.4) p d (t) = i/j d • 

Let Td = ( 71 , 72 , • • • , 7d+i) denote the coefficients of the polynomial xpd(t) in (15.31 ). The 
corresponding matrix operation Y = p d {A)X can be implemented by Algorithm[7]below. 


'Also see the website http : //www. chebfun . org/docs/guide/guide04 .html 
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Algorithm 7: Polynomial function: Y = POLY (/l, X, a, b. 1’^) 

1 Compute co = and c\ = Set Y = 'y±X. 

2 for j = 1,2, ..., d do Y = cqY + a AY + jj+i X. 


For a quick comparison, we plot our Chebyshev interpolates of degrees 2 to 7 and the 
Chebyshev polynomials of degrees 2 to 7 side by side in Figure 15.11 For both kinds of 
polynomials, the higher the degree is, the closer the curve is to the vertical line t = 1. We 
observe that inside the interval [—1,1], our Chebyshev interpolates have lower profiles (with 
magnitude less than or around 0.2 except near 1) than the Chebyshev polynomials which 
oscillate between ±1, while outside [—1,1] the Chebyshev polynomials grow faster. 




(b) Chebyshev polynomials of degrees 2 to 7 


FlG. 5.1. illustration of polynomial functions 


The idea of polynomial acceleration is straightforward and old, but its success is far from 
foolproof, largely due to inevitable errors in estimating intervals within which eigenvalues 
are supposed to be suppressed or promoted. The main reason for us to prefer our Chebyshev 
interpolates over the classic Chebyshev polynomials is that their lower profiles tend to make 
them less sensitive to erroneous intervals, hence easier to control. Indeed, our numerical 
comparison, albeit limited, appears to justify our choice. 

6. Details of ARRABIT Algorithms. In this section, we describe technical details and 
give parameter choices for our ARRABIT algorithm which computes k eigenpairs correspond¬ 
ing to k algebraically largest eigenvalues of a given symmetric matrix A. 

Guard vectors. When computing k eigenpairs, it is a common practice to compute a 
few extra eigenpairs to help guard against possible slow convergence. For this purpose, a 
small number of “guard vectors” are added to the iterate matrix X. In general, the more 
guard vectors are used, the less iterations are needed for convergence, but at a higher cost 
per iteration on memory and computing time. In our implementation, we set the number of 
columns in iterate matrix X to k + q, where by default q is set to 0.1k (rounded to the nearest 
integer). 

Estimation of A„ and A/, :+ , r To apply polynomial accelerators, we need to estimate the 
interval [a,b\ = [A n ,Afc+ g ] which contains unwanted eigenvalues. The smallest eigenvalue 
A n is computed by calling the the Matlab built-in solver EIGS (i.e., ARPACK jl 2| ). Given 
an initial matrix X £ ]R_ nx A'+'A whose columns are orthogonalized, an under-estimation of 
A k+q can be taken as the smallest eigenvalue of the projected matrix X T AX (which requires 
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an RR projection). As the iterations progress, more accurate estimates of A/. +(; will becomes 
available after each later ARR projection. 

Outer loop stop rule. Let (x*, /it), i = 1, 2, ■ • • , k, be computed Ritz pairs where 
xjxj = Sij. We terminate the algorithm when the following maximum relative residual 
norm becomes smaller than a prescribed tolerance tol, i.e., 

(6.1) maxres := max {res,} < tol, 


where 

( 6 . 2 ) 


res* 


II Axj - /ijXi |[ 2 
max(l, \ni\) 


The algorithm is also stopped in the following three cases: (i) if a maximum number of 
iterations, denoted by “maxit”, is reached (by default maxit = 30); or (ii) if the maximum 
relative residual norm has not been reduced after three consecutive outer iterations; or (iii) if 
most Ritz pairs have residuals considerably smaller than tol and the remaining have residuals 
slightly larger than tol; specifically, maxres < (1 + 9h/k)to\ (< 10 * tol), where h is the 
number of Ritz pairs with residuals less than 0.1 * tol. In our experiments we also monitor 
the computed partial trace Mi at the end for all solvers as a check for correctness. 

Continuation. When a high accuracy (say, tol < 10 -8 ) is requested, we use a contin¬ 
uation procedure to compute Ritz-pairs satisfying a sequence of tolerances: toll > tol2 > 

■ • • > tol, and use the computed Ritz-pairs for tolt as the starting point to compute the next 
solution for tolt+i. In our implementation, we use the update scheme 

(6.3) tol t+ i = max(10 -2 tol t , tol), 


where toll is chosen to be considerably larger than tol. A main reason for doing such a 
continuation is that our deflation procedure (see below) is tolerance-dependent. At the early 
stages of the algorithm, a stringent tolerance would delay the activation of deflation and likely 
cause missed opportunities in reducing computational costs. 

Inner loop parameters and stop rule. Both mpm and gn are tested as inner solvers 
to update X. These inner solvers are applied to the shifted matrix A — al which is suppos¬ 
edly positive semidefinite since a is a good approximation to A„ (computed by EIGS in our 
implementation). We check inner stopping criteria every maxit 2 iterations and check them 
at most maxiti times. In the present version, the default values for these two parameters are 
maxiti = 10 and maxit 2 = 5 Therefore, the maximum number of inner iterations allowed 
is maxiti x maxit 2 = 50. 

The inner loop stopping criteria are either 

(6.4) rc = rcond(X T X) < tol t or rc/rcp > 0.99, 


where tol* is the current tolerance (in a continuation sequence) and rep is the previously com¬ 
puted rcond(X). In (16.4b . we use the rcond subroutine in LAPACK (also used by Matlab) to 
estimate the reciprocal 1-norm condition number of X T X, which we find to be relatively in¬ 
expensive. The first condition in (16.4b indicates that X is about to lose (or have just lost) rank 
numerically, which implies that we achieve the goal of eliminating the unwanted eigenspace 
numerically. However, it is probable that a part of the desired eigenspace is also sacrificed, 
especially when there are clusters among the desired eigenvalues. Fortunately, this problem 
can be corrected, at a cost, in later iterations after deflation. On the other hand, the second 
condition is used to deal with the situation where the conditioning of X does not deteriorate, 
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which occurs from time to time in later iterations when there exists little or practically no 
decay in the relevant eigenvalues. 

Deflation. Since Ritz pairs normally have uneven convergence rates, a procedure of 
detecting and setting aside Ritz pairs that have “converged” is called deflation or locking, 
which is regularly used in eigensolvers because it not only reduces the problem size but also 
facilitates the convergence of the remaining pairs. In our algorithm, a Ritz pair {xi,pi) is 
considered to have “converged” with respect to a tolerance tol< if its residual (see (16.2b for 
definition) satisfies 

(6.5) resi < max(10^ 14 , tolj). 


After each ARR projection, we collect the converged Ritz vectors into a matrix Q c , and start 
the next iteration from those Ritz vectors “not yet converged”, which we continue to call 
X. Obviously, whenever Q c is nonempty X is orthogonal to Q c . Each time we check the 
stopping rule in the inner loop, we also perform a projection X = X — Q C (Q^X) to ensure 
that X stays orthogonal to Q c . In addition, the next ARR projection will also be performed 
in the orthogonal complement of 1Z(Q C ). That is, we apply an ARR projection to the matrix 
Y — QdQcY) f° r Y = [X AZ ■ ■ ■ A P X ]. At the end, we always collect and keep k + q 
leading Ritz pairs from both the “converged” and the “not yet converged” sets. 

Augmentation blocks. The default value for the number of augmentation blocks is 
p = 1, but this value may be adjusted after each ARR projection. We increase p by one 
when we find that the relevant Ritz values show a small decay and at the same time the latest 
decrease in residuals is not particularly impressive. Specifically, we set p = p + 1 if 


( 6 . 6 ) 


> 0.95 and 
Pk 


maxres 


maxresp 


> 0 . 1 , 


where maxresp is the maximum relative residual norm at the previous iteration. The values 
0.95 and 0.1 are set after some limited experimentation and by no means optimal. For k 
relatively large, since the memory demand grows significantly as p increases, we also limit 
the maximum value of p to p max = 3. 

Polynomial degree. Under normal conditions, the higher degree is used in a polynomial 
accelerator, the fewer number of iterations will be required for convergence, but at a higher 
cost per iteration. A good balance is needed. Let d and d max be the initial and the largest 
polynomial degrees, respectively. We use the default values d = 3 and d lnax = 15. Let pd.it) 
be the polynomial function defined in (15.4k After each ARR step, we adjust the degree based 
on estimated spectral information of Pd{A) computable using the current Ritz values. We 
know that the convergence rate of the inner solvers would be satisfactory if the eigenvalue 
ratio pd(\k+q)/Pd{^k) is small. Based on this consideration, we calculate 

(6.7) d = min I d £ Z : ^ d ^ k +<^ < q g\ 

Pd(p* k ) J 

and then apply the cap d max by setting 

(6.8) d = min(d, d max ) 


where pi and pl +q are a pair of Ritz values corresponding to the iteration with the smallest 
residual “maxres” defined in (16.11) (therefore the most accurate so far). The value of 0.9 is of 
course adjustable. 

Finally, a pseudocode for our ARRABIT algorithm with all the above features is presented 
as Algorithm [ 8 ] This is the version used to produce the numerical results of this paper. As 
one can see, ARRABIT algorithm uses A only in matrix multiplications. 
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Algorithm 8: Algorithm ARRABIT (detailed version) 


1 Input A £ R nxn , integer k £ (0, n) and tolerance tol > 0. 

2 Choose d and d max , the initial and maximum polynomial degrees. 

/* initialize */ 

Choose p and p max , the initial and maximum number of augmentation blocks. 
Choose q > 0, the number of guard vectors, so that [p + l)(k + q) < n. 

Set tolerance parameters: t = 1, tol t > tol and told = max(10 -14 , tol 2 ). 
Initialize converged Ritz pairs (Q c , E c ) = 0 for deflation purposes. 

Initialize an i.i.d. Gaussian random matrix X £ R nx ( k + < i\ 

8 Estimate the interval [A ra , Afc+ g ] w [a, &]. 

9 for y = 1...., maxit do 


10 

n 

12 

13 

14 

15 


Initialize rc to infinity, 
for i i = 1 , 2 , • • • , maxiti, do 

for i 2 = 1 , 2 , ■ • ■ , maxit 2 , do /1 

if MPM is the inner solver, then 

Call X = poly(A - al, X, 0, b - a, T d ). 
Normalize the columns of X individually. 


/* outer loop 

/* inner loop 
call inner solvers 
/* MPM 
/* accelerator 


16 

17 

18 

19 

20 

21 

22 


if GN is the inner solver, then / * GN 

Compute Y = X (X T X) \ 

Call Z = POLY(A — al, Y, 0, b — a, 1^). /* accelerator 

_ Compute X = Z- X(Y T Z - I)/ 2. 

Compute X = X — Q C (Q^X) if Q c ^ 0. /* projection 

Set rep = rc and compute rc = rcond(X T X). 

if the inner stop rule (16.4b is met, then break.: /* end inner loop 


23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 


Compute Y = [ X , AX, ..., A P X], / * augmentation 

Y = Y — QciQ^Y) if Q c ^ 0. /* projection 

Perform ARR step: ( X, E) = RR(A, Y). / * ARR 

Extract k + q leading Ritz pairs (xi, pi) from ( Q c , E c ) and ( X, E). 

Overwrite (X , E) by the k + q Ritz pairs. Compute residuals by (16.2k 
if the outer stop rule (16.1b is met for tol, then 

output the Ritz pairs [X, E) and exit. /* output and exit 

if the outer stop rule (16.1b is met for fob then /* continuation 

L Set tolt+i = max (l0 _2 tolt,tol), b = pk+q and t = t + 1. 

Collect converged Ritz pairs in ( Q c , E c ) that satisfy (16.5b . /* deflation 
Overwrite (X, E) by the remaining not yet converged Ritz pairs, 
if rules in (16.6b are met, then set p = min(p + 1 , p m ax)l / * update p 
Update the polynomial degree by rules (16.7b - (16.8b . /* update degree 


* / 

* / 
* / 
★ / 
* / 


* / 
* / 


* / 

★ / 

* / 
* / 
* / 


* / 
* / 

* / 

* / 
* / 


7. Numerical Results. In this section, we evaluate the performance of arrabit on a 
set of sixteen sparse matrixes. Although we have constructed the algorithm with parallel 
scalability in mind as a major motivating factor, a study of scalability issues in a massively 
parallel environment is beyond the scope of the current paper. 

As a first step, we test the algorithm in Matlab environment, on a single computing 
node (2 processors) and without explicit code parallelization, to determine how it performs 
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in comparison to established solvers. We have implemented our ARRABIT algorithm, as 
is described by the pseudocode Algorithm [8] in MATLAB. For brevity, the two variants, 
corresponding to the two choices of inner solvers, will be called MPM and GN, respectively. 

We test two levels of accuracy in our experiments: tol = 10 -6 or tol = 10 -12 . By our 
stoping rule, upon successful termination the largest eigenpair residual will not exceed 10 -5 
or 10 -11 , respectively. Since our algorithm checks the termination rule only after each ARR 
call, it often returns solutions of higher accuracies than what is prescribed by the tol value. 

7.1. Solvers, Platform and Test Matrices. Since it is impractical to carry out numer¬ 
ical experiments with a large number of solvers, we have carefully chosen two high-quality 
packages to compare with our ARRABIT code. One package is ARPACK0 lfT2l , which is be¬ 
hind the Matlab built-in iterative eigensolver EIGS, and will naturally serve as the benchmark 
solver. Another is a more recent package called FEAST j28l which has been integrated into 
Intel’s Math Kernel Library (MKL) under the name “Intel MKL Extended Eigensolver’0. 
Both ARPACK and FEAST are written in Fortran. While ARPACK can be directly accessed 
through EIGS in Matlab, we call FEAST from Intel’s MKL Library via Matlab’s MEX exter¬ 
nal interfaces. In our experiments, all parameters in EIGS and FEAST are set to their default 
values, and each solver terminates with its own stopping rules using either tol = 10” 6 or 
tol = 10 -12 . 

We have also examined a few other solvers as potential candidates but decided not to 
use them in this paper, including but not limited to the filtered Lanczos algorithrr@ [[6] and 
the Chebyshev-Davidson algorithnf] | |29| . Our initial tests indicated that, for various reasons, 
these solvers’ overall performance could not measure up with that of the commercial-grade 
software packages ARPACK and FEAST on a number of test problems. This fact may be more 
of a reflection on the current status of software development for these solvers than on the 
merits of the algorithms behind. 

It is important to note that FEAST is designed to compute all eigenvalues (and their eigen¬ 
vectors) in an interval, which is given as an input along with an estimated number of eigen¬ 
values inside the interval. When computing k largest eigenpairs, we have observed that the 
performance of FEAST is affected greatly by the quality of the two estimations: the interval 
itself and the number of eigenvalues inside the interval. When calling FEAST, we set (i) the 
interval to be [Aj£, Aj] where X* k and Aj are computed eigenvalues by EIGS using the same 
tolerance tol; and (ii) the estimated number of eigenvalues in the interval to 1.2 k rounded to 
the nearest integer. We consider this setting to be fair, if not overly favorable, to FEAST. 

Our numerical experiments are preformed on a single computing node of Ed iso if], a Cray 
XC30 supercomputer maintained at the National Energy Research Scientific Computer Cen¬ 
ter (NERSC) in Berkeley. The node consists of two twelve-core Intel “Ivy Bridge” processors 
at 2.4 GHz with a total of 64 GB shared memory. Each core has its own LI and L2 caches 
of 64 KB and 256 KB, respectively; A 30-MB L3 cache shared between 12 cores on the “Ivy 
Bridge” processor. We generate Matlab standalone executable programs and submit them as 
batch jobs to Edison. The reported runtimes are wall-clock times. 

On a multi/many-core computer, memory access patterns and communication overheads 
have a notable impact on computing time. In Matlab, dense linear algebra operations are 
generally well optimized by using BLAS and LAPACK tuned to the CPU processors in use. 
On the other hand, we have observed that some sparse linear algebra operations in Matlab 


’See http://www.caam.rice.edu/software/ARPACK/ 

3 See http \]/ software . intel. com/en-us/intel-mkl (version 11.0.2 on our workstation) 
4 See http://www-users.cs.umn.edu/~saad/software/fiItlan 
5 See http://faculty.smu.edu/yzhou/code.htm 

6 See http://www.nersc.gov/users/computational-systems/edison/ 



Block algorithms with an ARR procedure for large-scale exterior eigenpair computation 


21 


seem to have not been as highly optimized (at least in version 2013b). In particular, when 
doing multiplications between a large sparse matrix and a dense matrix (like AX), Matlab 
is often slower than a routine in Intel’s Math Kernel Library (MKL) named “mkl_dcscmm” 
when it is invoked through Matlab’s MEX external interfaces in our experiments. For this 
reason, we use this MKL routine in our Matlab code to perform the operation AX. 

Our test matrices are selected from the University of Florida Sparse Matrix CollectiorQ. 
For each matrix, we compute both k eigenpairs corresponding to k largest eigenvalues and 
those corresponding to k smallest eigenvalues. Many of the selected matrices are produced by 
PARSEC GO, a real space density functional theory (DFT) based code for electronic structure 
calculation in which the Hamiltonian is discretized by a finite difference method. We do not 
take into account any background information for these matrices; instead, we simply treat 
them algebraically as matrices. 

Table l7.ll lists. for each matrix A, the dimensionality n, the number of nonzeros nnz(A) 
and the density of A, i.e., the ratio (nnz(A)/n 2 ) 100%. The number of eigenpairs to be com¬ 
puted is set either to 1% of n rounded to the nearest integer or to k = 1000 whichever is 
smaller. Table l7.ll also reports the number of the nonzeros in the Cholesky factor L of matrix 
A — al where a = iriax(2A n (A). 0). The factorization is carried out after an “approxi¬ 
mate minimum degree” permutation performed by the Matlab function “amd”, as is done by 
the following MATLAB line: t = amd(B); L = chol(B(t, t)/ lower'). We have also tested 
the “symmetric approximate minimum degree” permutation (“symamd” in Matlab), but the 
corresponding density of L is slightly larger on most matrices. The density of factor L and 
the computing time in seconds used by Cholesky factorization are also given in Table 17.11 
Although all matrices A are very sparse, the Cholesky factors of some matrices, such as 
Gal0Asl0H30, Ga3As3H12 and Ge87H76, are quite dense. As a result, the Cholesky factor¬ 
ization time varies greatly from matrix to matrix. We mention that the spectral distributions 
of the test matrices can behave quite differently from matrix to matrix. Even for the same 
matrix, the spectrum of a matrix can change behavior drastically from region to region. Most 
notably, computing k smallest eigenpairs of many matrices in this set turns out to be more 
difficult than computing k largest ones. 

The largest matrix size in this set is more than a quarter of million. Relative to the com¬ 
puting resources in use, we consider these selected matrices to be fairly large scale. Overall, 
we consider this test set reasonably diverse and representative, fully aware that there always 
exist instances out there that are more challenging to one solver or another. 

7.2. Comparison between RR and ARR. We first evaluate the performance difference 
between ARR and RR for both MPM and GN. Table 17721 gives results for computing both 
k largest and smallest eigenpairs on the first six matrices in Table 17.11 to the accuracy of 
tol = 10 -12 . We note that RR and ARR correspond to p = 0 and p > 0, respectively, in 
Algorithm^ In order to differentiate the effect of changing p from that of changing the poly¬ 
nomial degree, we also test a variant of Algorithm[8]with a fixed polynomial degree at <7 = 8 
(by skipping line 34). In Table [731 “maxres” denotes the maximum relative residual norm 
in (16.11) . “time” is the runtime measured in seconds, “RR” is the total number of the outer 
iterations, i.e., the total number of the RR or ARR calls made (excluding the one called in 
preprocessing for estimating A k+q), and “p” and “d” are the number of augmentation blocks 
and the polynomial degree, respectively, used at the final outer iteration. In addition, on the 
matrices cfdl and finance we plot the (outer) iteration history of maxres in Figures ITT! and 
I7.2l for computing k largest and smallest eigenpairs, respectively. 

The following observations can be drawn from the table and figures. 


7 See http : //www.cise.uf1.edu/research/sparse/matrices 
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Table 7.1 

Information of Test Matrices 


matrix name 

n 

k 

nnz(A) 

density of A 

nnz(L) 

density of L 

time 

Andrews 

60000 

600 

760154 

0.021% 

117039940 

6.502% 

7.18 

C60 

17576 

176 

407204 

0.132% 

34144169 

22.105% 

1.62 

cfdl 

70656 

707 

1825580 

0.037% 

35877440 

1.437% 

1.81 

finance 

74752 

748 

596992 

0.011% 

2837714 

0.102% 

0.28 

Gal0Asl0H30 

113081 

1000 

6115633 

0.048% 

1562547805 

24.439% 

127.12 

Ga3As3H12 

61349 

613 

5970947 

0.159% 

596645077 

31.705% 

42.00 

shallow .water Is 

81920 

819 

327680 

0.005% 

2357535 

0.070% 

0.21 

SH0H16 

non 

171 

875923 

0.300% 

56103003 

38.474% 

2.60 

Si5H12 

19896 

199 

738598 

0.187% 

78918573 

39.871% 

3.80 

SiO 

33401 

334 

1317655 

0.118% 

186085449 

33.359% 

10.01 

wathenlOO 

30401 

304 

471601 

0.051% 

1490209 

0.322% 

0.32 

Ge87H76 

112985 

1000 

7892195 

0.062% 

1403571238 

21.990% 

109.64 

Ge99H 100 

112985 

1000 

8451395 

0.066% 

1477089634 

23.141% 

120.08 

Si41Ge41H72 

185639 

1000 

15011265 

0.044% 

3457063398 

20.063% 

358.53 

Si87H76 

240369 

1000 

10661631 

0.018% 

5568995364 

19.277% 

1499.80 

Ga41As41H72 

268096 

1000 

18488476 

0.026% 

6998257446 

19.473% 

2498.43 


• The performances of MPM and GN are similar. For both of them, ARR can accel¬ 
erate convergence, reduce the number of outer iterations needed, and improve the 
accuracy, often to a great extent. 

• The scheme of adaptive polynomial degree generally works better than a fixed poly¬ 
nomial degree. A more detailed look at the effect of polynomial degrees is presented 
in Section 17731 

• The default value p = 1 for the number of augmentation blocks in ARR is generally 
kept unchanged (recall that it can be increased by the algorithm). 

• The total number of ARR called is mostly very small, especially in the cases where 
the adaptive polynomial degree scheme is used and the k largest eigenpairs are com¬ 
puted (which tend to be easier than the k smallest ones). We observe from Figure 
17. II that in several cases a single ARR is sufficient to reach the accuracy of tol=le-6 
(even of tol=le-12 in one case). 

7.3. Comparison on Polynomials. We next examine the effect of polynomial degrees 
on the convergence behavior of MPM and GN, again on the first six matrices in Table 17.11 
We compare two schemes: the first is to use a fix degree among {4, 8,15} and skip line 34 of 
Algorithmic] and the second is the adaptive scheme in Algorithm^] The computational results 
are summarized in Table 17731 We also plot the iteration history of maxres, for computing 
both k largest and smallest eigenpairs on the matrices cfdl and finance in Figures l773l and l774l 
respectively. The numerical results lead to the following observations: 

• Again the performances of MPM and GN are similar, and the default value p = 1 for 
augmentation is mostly unchanged. 

• In general, the number of outer iterations is decreased as the polynomial degree is 
increased, but the runtime time is not necessarily reduced because of the extra cost 
in using higher-degree polynomials. Overall, our adaptive strategy seems to have 
achieved a reasonable balance. 

• With fixed polynomial degrees, in a small number of test case MPM and GN fail to 
reach the required accuracy. 

Finally, we compare the performance of Algorithm [8] either using Chebyshev interpo¬ 
lates defined in (15.31) or the Chebyshev polynomials defined in (12.61) on the first six matrices 
in Table l7.l1 The comparison results are given in Table 17.41 Even though both types of poly- 
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Table 7.2 

Comparison results between RR and ARR with tol=le-12 


MPM with RR MPM with ARR GN with RR GN with ARR 


matrix 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

computing k largest eigpair by fix deg = 8 

Andrew. 

9.5e-13 

191 

4 

1/8 

1.9e-06 

250 

9 

3/8 

9.0e-12 

174 

6 

1/8 

9.9e-13 

104 

2 

1/8 

C60 4.0e-12 

45 

11 

3/8 

6.3e-12 

12 

3 

1/8 

7.5e-12 

44 

22 

3/8 

1.4e-12 

16 

5 

1/8 

cfdl 

9.8e-13 

381 

4 

1/8 

1.0e-12 

296 

4 

1/8 

9.8e-13 

294 

4 

1/8 

9.9e-13 

206 

2 

1/8 

financ. 

9.9e-13 

157 

3 

1/8 

8.9e-13 

151 

3 

1/8 

1.0e-12 

196 

4 

1/8 

1.0e-12 

141 

2 

1/8 

GalOAs. 

3.5e-13 

1218 

22 

3/8 

9.9e-13 

1483 

8 

2/8 

6.1e-12 

910 

8 

1/8 

9.9e-13 

448 

3 

1/8 

Ga3As3. 

9.7e-13 

467 

6 

1/8 

9.8e-13 

270 

5 

1/8 

1.9e-12 

307 

8 

1/8 

9.4e-13 

179 

3 

1/8 

computing k largest eigpair with adaptive polynomial degree 

Andrew. 

2.0e-l1 

337 

9 

3/5 

8.8e-13 

148 

5 

2/5 

5.3e-12 

319 

17 

3/5 

1.0e-12 

125 

4 

1/5 

C60 

8.7e-12 

41 

10 

3/9 

2.0e-12 

13 

3 

1/9 

4.2e-12 

42 

20 

3/9 

5.5e-12 

13 

3 

1/9 

cfdl 

1.3e-12 

441 

5 

1/3 

9.8e-13 

190 

4 

1/3 

4.1e-12 

482 

17 

3/3 

9.9e-13 

188 

3 

1/3 

financ. 

9.9e-13 

256 

4 

1/3 

1.3e-12 

97 

3 

2/3 

2.7e-12 

380 

14 

3/3 

l.le-12 

69 

1 

1/3 

GalOAs. 

4.7e-12 

1199 

6 

1/5 

9.6e-13 

442 

4 

1/5 

7.1e-12 

1442 

19 

3/5 

9.7e-13 

580 

4 

1/6 

Ga3As3. 

2.9e-12 

473 

7 

2/5 

1.7e-12 

169 

4 

1/5 

3.9e-12 

494 

17 

3/5 

1.7e-12 

198 

4 

1/5 






computing k smallest eigpair by fix deg = 5 

3 






Andrew. 

4.2e-12 

465 

7 

2/8 

1.5e-13 

219 

6 

2/8 

7.2e-12 

475 

19 

3/8 

1.0e-12 

199 

5 

1/8 

C60 

1.7e-12 

30 

9 

3/8 

6.8e-13 

17 

6 

1/8 

5.5e-12 

24 

13 

3/8 

6.7e-12 

13 

4 

1/8 

cfdl 

4.1e-05 

2870 

30 

3/8 

6.0e-12 

1543 

21 

3/8 

1,5e-04 

2505 

30 

3/8 

7.9e-12 

1394 

22 

3/8 

financ. 

3.8e-08 

1759 

30 

3/8 

5.le-13 

700 

9 

3/8 

3.5e-06 

1651 

30 

3/8 

7.2e-13 

713 

11 

3/8 

GalOAs. 

8.6e-10 2642 

10 

3/8 

3.7e-12 

1372 

5 

1/8 

2.1e-02 

1436 

6 

1/8 

2.6e-12 

961 

4 

1/8 

Ga3As3. 

7.2e-12 

964 

11 

3/8 

2.7e-12 

489 

4 

1/8 

4.2e-12 

994 

24 

3/8 

9.9e-13 

381 

4 

1/8 

computing k smallest eigpair with adaptive polynomial degree 

Andrew. 

7.3e-12 

466 

8 

3/8 

9.7e-13 

200 

4 

1/8 

8.9e-12 

505 

21 

3/8 

l.le-12 

185 

5 

1/8 

C60 6.7e-12 

38 

9 

3/7 

2.8e-12 

26 

9 

3/6 

4.0e-12 

31 

23 

3/6 

9.2e-13 

15 

8 

2/6 

cfdl 

3.7e-08 

2869 

30 

3/15 

8.9e-12 

719 

4 

1/15 

2.3e-06 

2515 

30 

3/15 

4.2e-12 

1017 

12 

3/15 

financ. 

3.7e-12 

1391 

9 

3/15 

1.4e-12 

600 

6 

1/15 

5.3e-12 

1416 

24 

3/15 

3.4e-12 

467 

5 

1/15 

GalOAs. 

4.5e-ll 

3261 

12 

3/8 

l.le-12 

1558 

6 

1/8 

2.9e-12 

3681 

24 

3/8 

4.0e-12 

963 

3 

1/9 

Ga3As3. 

5.9e-12 

1046 

8 

3/9 

9.9e-13 

420 

4 

1/9 

7.7e-12 

1238 

24 

3/9 

9.5e-13 

338 

5 

1/9 


nomials work well on these six problems, some performance differences are still observable 
in favor of our polynomials. 

7.4. Comparison with ARPACK and FEAST. We now compare MPM and GN with 
EIGS and FEAST for computing both k largest and smallest eigenpairs for all sixteen test ma¬ 
trices presented in Tables 17711 (which also lists the k values). Computational results are sum¬ 
marized in Tables 17751 and !7.6l where “SpMV” denotes the total number of SpMVs, counting 
each operation AX € R nxk as k SpMVs. 

In addition, the speedup with respect to the benchmark time of EIGS is measured by the 
quantity log 2 (time E1GS /time), as shown in Figures l7T5l and l7T6l where a positive bar represents 
a “speedup” and a negative one a “slowdown”. In these two figures, matrices are ordered from 
left to right in ascending order of the solution time used by EIGS; that is, when moving from 
the left towards the right, problems become progressively more and more time-consuming for 
EIGS to solve. A quick glance at the figures tells us that MPM and GN provide clear speedups 
over EIGS on most problems, especially on the more time-consuming problems towards the 
right. For example, MPM and GN deliver a speedup of about 4 times on each of the seven 
most time-consuming problems in Figure [731 ah and a speedup of about 10 times on the most 
time-consuming problem Ga41As41H72 in Figure I776l a). On the other hand, compared to 
EIGS, feast’s timing profile looks volatile with both big “speedups” and “slowdowns”. 

The benchmark solver EIGS usually, though not always, returns solutions more accurate 
than what is requested by the tolerance value. In particular, for tol = 10~ 6 the accuracy of 
EIGS solutions often reach the order of 0( 10 -12 ). This is due to the fact that EIGS need to 
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FIG. 7.1. ARR vs RR: Iteration history of maxres for computing k largest eigenpairs 




FIG. 7.2. ARR vs RR: Iteration history of maxres for computing k smallest eigenpairs 


maintain a high working accuracy to ensure proper convergence. 

As is observed previously, it is often more time-consuming for EIGS, MPM and GN to 
compute k smallest eigenpairs than k largest ones on many test matrices. By examining 
the spectra of the matrices such as cfdl and finance, we believe that this phenomenon is 
attributable to the property that these matrices tend to have a flatter end on the left end of 
their spectra. On the other hand, the behavior of FEAST appears less affected by this property 
but more by sparsity patterns (see below). 

Concerning the performance of FEAST, we make the following observations. 

• FEAST solves most problems successfully but fails to correctly solve a few cases. 
When computing k largest eigenvalues for the matrix Gal0Asl0H30 FEAST returns 
the warning: “No eigenvalue has been found in the proposed search interval”. On 
matrix Ga3As3H12, it seems to exit normally with the output messages “Eigen- 
solvers have successfully converged”, but the subsequently computed maximum rel¬ 
ative residual norm in (16. Il l is way too large at 0.29. On matrices Ga41 As41H72 and 
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FIG. 7.4. ARR: Iteration history o/maxres for computing k smallest eigenpairs using different polynomial degrees 


Si87H76, when computing either k largest or smallest eigenpairs, FEAST terminates 
abnormally after spending a long computing time, with the message: “Eigensolvers 
ERROR: Problem from Inner Linear System Solver”. By examining the density of 
Cholesky factors for Ga41As41H72 and Si87H76 in Table 177X1 we speculate that 
the abnormal termination most likely has to do with excessive memory demands 
encountered by the inner linear system solver in Intel Math Kernel Library. 

• For tol = 10 -12 , FEAST is the fastest in solving finance and shallow .water Is for 
k largest eigenpairs, and in solving cfdl, finance, shallow.waterls and wathenlOO 
for k smallest eigenpairs. On the other hand, FEAST can be significantly slower 
than others on matrices such as Gal0Asl0H30, Ga3As3H12, Ge87H76, Ge99H100, 
Si41Ge41H72, Si87H76 and Ga41As41H72. The performance of FEAST can be at 
least partly explained from the density of Cholesky factors L shown in Table 17.11 
since FEAST uses a direct linear solver in Intel Math Kernel Library to compute 
factorizations of matrices of the form ( (f>il — A) in (12.71) . We can clearly see the 
correlation that FEAST is fast when the density of the Cholesky factor is low and 
Cholesky factorization is fast. 
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Table 7.3 

Comparison results of different polynomial degrees on tol=le-12 


matrix 


deg=4 




deg=8 




deg=15 


adaptive deg 


maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

MPM for k largest eigpair 

Andrew. 

l.le-12 

127 

5 

2/4 

1.9e-06 

250 

9 

3/8 

4.3e-12 

165 

4 

1/15 

8.8e-13 

148 

5 

2/5 

C60 

1.6e-12 

18 

6 

3/4 

6.3e-12 

12 

3 

1/8 

9.7e-13 

24 

3 

2/15 

2.0e-12 

13 

3 

1/9 

cfdl 

1.8e-12 

206 

3 

1/4 

1.0e-12 

296 

4 

1/8 

2.8e-12 

411 

5 

2/15 

9.8e-13 

190 

4 

1/3 

financ. 

9.9e-13 

102 

3 

1/4 

8.9e-13 

151 

3 

1/8 

9.0e-13 

175 

4 

1/15 

1.3e-12 

97 

3 

2/3 

GalOAs. 

1.3e-12 

906 

8 

2/4 

9.9e-13 

1483 

8 

2/8 

2.8e-01 

5908 

6 

1/15 

9.6e-13 

442 

4 

1/5 

Ga3As3. 

7.6e-13 

377 

7 

1/4 

9.8e-13 

270 

5 

1/8 

2.8e-01 

1483 

6 

1/15 

1.7e-12 

169 

4 

1/5 

SLRP for k largest eigpair 

Andrew. 

1.5e-12 

116 

4 

1/4 

9.9e-13 

104 

2 

1/8 

1.2e-13 

187 

2 

1/15 

1.0e-12 

125 

4 

1/5 

C60 

1.5e-12 

24 

9 

3/4 

1.4e-12 

16 

5 

1/8 

7.1e-13 

19 

3 

1/15 

5.5e-12 

13 

3 

1/9 

cfdl 

9.6e-13 

185 

2 

1/4 

9.9e-13 

206 

2 

1/8 

1.7e-13 

324 

2 

1/15 

9.9e-13 

188 

3 

1/3 

financ. 

1.2e-12 

77 

1 

1/4 

1.0e-12 

141 

2 

1/8 

2.7e-13 

327 

2 

1/15 

l.le-12 

69 

1 

1/3 

GalOAs. 

5.9e-13 

734 

7 

2/4 

9.9e-13 

448 

3 

1/8 

2.9e-01 

1122 

6 

1/15 

9.7e-13 

580 

4 

1/6 

Ga3As3. 

8.4e-12 

205 

4 

1/4 

9.4e-13 

179 

3 

1/8 

6.4e-02 

442 

6 

1/15 

1.7e-12 

198 

4 

1/5 

MPM for k smallest eigpair 

Andrew. 

4.1e-13 

247 

9 

3/4 

1.5e-13 

219 

6 

2/8 

9.9e-13 

448 

5 

1/15 

9.7e-13 

200 

4 

1/8 

C60 

1,6e-07 

20 

7 

3/4 

6.8e-13 

17 

6 

1/8 

7.9e-13 

26 

5 

1/15 

2.8e-12 

26 

9 

3/6 

cfdl 

2.5e-07 

1626 

30 

3/4 

6.0e-12 

1543 

21 

3/8 

4.3e-12 

1340 

9 

3/15 

8.9e-12 

719 

4 

1/15 

financ. 

6.9e-12 

1002 

21 

3/4 

5.le-13 

700 

9 

3/8 

1.0e-12 

586 

5 

1/15 

1.4e-12 

600 

6 

1/15 

GalOAs. 

9.4e-12 

1893 

15 

3/4 

3.7e-12 

1372 

5 

1/8 

1.8e-06 

2198 

6 

2/15 

l.le-12 

1558 

6 

1/8 

Ga3As3. 

4.9e-12 

569 

11 

3/4 

2.7e-12 

489 

4 

1/8 

9.7e-13 

471 

4 

1/15 

9.9e-13 

420 

4 

1/9 

SLRP for k smallest eigpair 

Andrew. 

4.6e-12 

315 

10 

3/4 

1.0e-12 

199 

5 

1/8 

9.9e-13 

208 

3 

1/15 

l.le-12 

185 

5 

1/8 

C60 

1.2e-12 

16 

9 

2/4 

6.7e-12 

13 

4 

1/8 

4.1e-13 

16 

3 

1/15 

9.2e-13 

15 

8 

2/6 

cfdl 

9.1e-07 

1956 

30 

3/4 

7.9e-12 

1394 

22 

3/8 

5.2e-12 

1121 

12 

3/15 

4.2e-12 

1017 

12 3/15 

financ. 

7.4e-12 

1223 

22 

3/4 

7.2e-13 

713 

11 

3/8 

1.6e-12 

535 

6 

1/15 

3.4e-12 

467 

5 

1/15 

GalOAs. 

1.6e-12 

1625 

8 

3/4 

2.6e-12 

961 

4 

1/8 

1.0e-12 

999 

3 

1/15 

4.0e-12 

963 

3 

1/9 

Ga3As3. 

4.8e-12 

532 

10 

3/4 

9.9e-13 

381 

4 

1/8 

9.8e-13 

374 

3 

1/15 

9.5e-13 

338 

5 

1/9 


With regard to the performance of MPM and GN, we make the following observations. 

• MPM and GN both attain the required accuracy on all test problems, and they often 
return smaller residual errors than what is required by tol. Generally speaking, the 
two variants perform quite similarly in terms of both accuracy and timing. 

• MPM and GN maintain a clear speed advantage over FEAST in most tested cases. 
They are faster than FEAST when either factorizations of shifted A are expensive, or 
when spectral distributions have a favorable decay (for example, on cfdl for com¬ 
puting k largest eigenpairs). 

• MPM and GN also maintain an overall speed advantage over EIGS, especially on 
those problems more time-consuming for EIGS (towards the right end of Figures 
1731 and | 73 . They are faster in spite of taking considerably more matrix-vector 
multiplications than EIGS, as can be seen from Tables 17.51 and 17.61 thanks to the 
benefits of relying on high-concurrency operations on many-core computers. 

• MPM and GN generally require a smaller number ARR calls, often only two or three 
when computing k largest eigenpairs. In quite a number of cases (for example, on 
finance and wathenlOO for MPM and so on), only a single ARR projection is taken 
which is absolutely optimal in order to extract approximate eigenpairs. 

• The number of augmentation blocks used by MPM and GN is usually 1, and the final 
polynomial degree never reaches the maximum degree 15 except on cfdl, finance 
and wathenlOO when computing k smallest eigenpairs. 

In Figure [ 777 ] we plot runtimes of three categories: SpMV (i.e., AX), SU (lines 10 to 22 
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Table 7.4 

Comparison results on Chebyshev interpolates in ED and Chebyshev polynomials in ED 


name 


MPM 



MPM, Cheb. poly. 


GN 



GN, Cheb. poly. 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 

maxres 

time 

RR 

p/d 






computing k largest eigpair, tol= 

le-6 







Andrew. 

2.6e-8 

58 

2 

1/5 

l.le-6 

60 

2 

1/3 

3.0e-8 

92 

2 

1/5 

6.0e-7 

89 

2 

1/3 

C60 

l.le-9 

13 

2 

1/9 

3.9e-8 

9 

2 

1/5 

5.2e-7 

9 

2 

1/8 

8.8e-6 

12 

3 

1/5 

cfdl 

5.6e-9 

155 

2 

1/3 

7.4e-7 

144 

2 

1/2 

1.5e-7 

143 

1 

1/3 

1.5e-7 

146 

1 

1/3 

financ. 

1.6e-6 

37 

1 

1/3 

1.5e-10 

51 

1 

1/3 

l.le-12 

67 

1 

1/3 

1.2e-10 

68 

1 

1/3 

GalOAs. 

5.7e-8 

264 

2 

1/5 

4.6e-8 

550 

4 

1/2 

9.2e-7 

380 

2 

1/5 

2.0e-7 

484 

3 

1/3 

Ga3As3. 

6.4e-8 

101 

2 

1/5 

1.6e-6 

112 

3 

1/3 

5.3e-7 

136 

2 

1/5 

6.2e-6 

125 

2 

1/3 

computing k largest eigpair, tol=le-12 

Andrew. 

8.8e-13 

148 

5 

2/5 

2.9e-12 

199 

7 

2/ 10 

1.0e-12 

125 

4 

1/5 

1.5e-12 

160 

7 

3/3 

C60 

2.0e-12 

13 

3 

1/9 

4.6e-12 

16 

6 

3/5 

5.5e-12 

13 

3 

1/9 

4.5e-12 

23 

12 

3/5 

cfdl 

9.8e-13 

190 

4 

1/3 

3.3e-13 

230 

6 

2/2 

9.9e-13 

188 

3 

1/3 

1.8e-12 

215 

4 

1/2 

financ. 

1.3e-12 

97 

3 

2/3 

6.8e-12 

87 

3 

1/2 

l.le-12 

69 

1 

1/3 

9.9e-13 

93 

2 

1/2 

GalOAs. 

9.6e-13 

442 

4 

1/5 

9.0e-12 

643 

9 

3/3 

9.7e-13 

580 

4 

1/6 

1.3e-12 

807 

9 

3/3 

Ga3As3. 

1.7e-12 

169 

4 

1/5 

2.2e-12 

239 

9 

3/3 

1.7e-12 

198 

4 

1/5 

4.7e-13 

285 

9 

3/3 






computing k smallest eigpair, tol= 

=le-6 







Andrew. 

4.2e-7 

113 

2 

1/8 

6. le-7 

122 

3 

1/5 

5.2e-9 

168 

3 

1/8 

2.6e-6 

175 

4 

1/5 

C60 

9.6e-7 

16 

4 

2/6 

1.3e-6 

11 

3 

1/4 

2.4e-6 

9 

3 

1/3 

1.4e-6 

10 

4 

1/4 

cfdl 

3.4e-7 

601 

2 

1/15 

5.0e-6 

427 

2 

1/ 15 

4.8e-6 

614 

5 

2/ 15 

2.7e-6 

607 

5 

2/ 15 

financ. 

1.7e-6 

338 

2 

1/ 15 

3.2e-6 

310 

2 

1/ 10 

5.3e-9 

379 

3 

1/15 

9.3e-7 

333 

3 

1/ 10 

GalOAs. 

6.2e-6 

751 

2 

1/8 

2.9e-6 

744 

3 

1/5 

1.8e-6 

715 

2 

1/7 

2.8e-6 

907 

3 

1/5 

Ga3As3. 

6.9e-6 

325 

2 

1/9 

4.2e-7 

269 

2 

1/5 

1.7e-9 

282 

3 

1/9 

1.6e-6 

369 

5 

2/5 






computing k smallest eigpair, tol= 

le-12 







Andrew. 

9.7e-13 

200 

4 

1/8 

7.3e-12 

243 

8 

3/5 

l.le-12 

185 

5 

1/8 

7.0e-12 

293 

11 

3/5 

C60 

2.8e-12 

26 

9 

3/6 

3.4e-12 

23 

9 

3/4 

9.2e-13 

15 

8 

2/6 

2.le-12 

18 

11 

3/4 

cfdl 

8.9e-12 

719 

4 

1/ 15 

8.7e-12 

1033 

11 

3/ 15 

4.2e-12 

1017 

12 

3/ 15 

9.0e-12 

1471 

23 

3/15 

financ. 

1.4e-12 

600 

6 

1/ 15 

8.6e-12 

587 

8 

3/ 10 

3.4e-12 

467 

5 

1/15 

9.0e-12 

637 

10 

3/10 

GalOAs. 

l.le-12 

1558 

6 

1/8 

7.6e-8 

1629 

9 

3/ 15 

4.0e-12 

963 

3 

1/9 

5.2e-12 

1496 

9 

3/5 

Ga3As3. 

9.9e-13 

420 

4 

1/9 

2.0e-12 

547 

9 

3/5 

9.5e-13 

338 

5 

1/9 

3.6e-12 

573 

14 

3/5 


of Algorithm [8} and ARR (lines 23 to 27 of Algorithm [8}. In particular, SpMVs are called 
in both SU and ARR, but overwhelmingly in the former. These are the major computational 
components of MPM and GN. The runtime of each category is measured in the percentage of 
wall-clock time spent in that category over the total wall-clock time. We can see, especially 
from the time-consuming problems on the right, that (i) the time of SU dominates that of RR, 
and (ii) the time of SpMVs, always done in batch of k+q, dominates the entire computation in 
almost all cases. These trends are much more pronounced (a) for MPM than for GN (recall that 
GN requires to solve k x k linear systems); and (b) for computing k smallest eigenpairs than 
for computing k largest ones (recall that the former is generally more difficult). These runtime 
profiles are favorable to parallel scalability since AX operations possess high concurrency for 
relatively large k. 

In the final set of experiments, we examine the solvers’ scalability with respect to k. We 
apply the solvers to matrices cfdl and Ge87H76, with tol = 10 -12 , and vary k from 100, 200 
up to 1200 with increment 200 (there are exceptions for FEAST). The resulting solution times 
are plotted in Figures 17781 and [7791 In both figures, the slopes of the time curves confirm that 
the three block algorithms, FEAST, MPM and GN, clearly scale better with respect to k than the 
Krylov subspace algorithm EIGS. Although EIGS can be the fastest for k small, its solution 
time increases at a faster pace than the block methods as k increases. 

Among the block algorithms themselves, all three provide comparable performances on 
cfdl when computing the k largest eigenpairs, while FEAST is the fastest when computing k 
smallest eigenpairs. On Ge87H76, which has a rather dense Cholesky factor, FEAST is much 
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FIG. 7.6. Speedup to EIGS: log 2 (time E i G s/time) on computing k smallest eigenpairs 


slower in all runs up to k = 1000 (runs for k > 1000 are skipped to save time). 

8. Concluding Remarks. The goal of this paper is to construct a block algorithm of 
high scalability suitable for computing relatively large numbers of exterior eigenpairs for 
really large-scale matrices on modern computers. Our strategy is simple: to reduce as much 
as possible the number of RR calls (Rayleigh-Ritz projections) or, in other words, to shift as 
much as possible computation burdens to SU (subspace update) steps. This strategy is based 
on the following considerations. RR steps perform small dense eigenvalue decompositions, 
as well as basis orthogonalizations, thus possessing limited concurrency. On the other hand, 
SU steps can be accomplished by block operations like A times X, thus more scalable. 

To reach for maximal concurrency, we choose the power iteration for subspace updating 
(and also include a Gauss-Newton method to test the versatility of our construction). It is well 
known that the convergence of the power method can be intolerably slow, preventing it from 
being used to drive general-purpose eigensolvers. Therefore, the key to success reduces to 
whether we could accelerate the power method sufficiently and reliably to an extent that it can 
compete in speed with Krylov subspace methods in general. In this work, such an acceleration 
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Table 7.5 

Comparison results on computing k largest eigenpairs 



EIGS 


FEAST 



MPM 




GN 


name 

maxres 

time SpMV 

maxres 

time RR 

maxres 

time SpMV RR/p/d 

maxres 

time SpMV RR/p/d 





tol= 

le-6 











Andrew. 

1.0e-7 

218 

3e+3 

1.0e-8 

254 

5 

2.6e-8 

58 

6e+4 

2/ 

1/5 

3.0e-8 

92 

6e+4 

2/ 1/5 

C60 

4.9e-8 

13 

2e+3 

7.9e-9 

59 

3 

l.le-9 

13 

5e+4 

2/ 

1/9 

5.2e-7 

9 

3e+4 

2/1/8 

cfdl 

2.5e-14 

338 

3e+3 

4.2e-8 

113 

4 

5.6e-9 

155 

6e+4 

2/ 

1/3 

1.5e-7 

143 

4e+4 

1/ 1/3 

financ. 

3.1e-14 

287 

3e+3 

6.1e-10 

41 

3 

1.6e-6 

37 

2e+4 

1/ 

1/3 

l.le-12 

67 

3e+4 

11 1/3 

GalOAs. 

4.2e-14 1439 

8e+3 

1.6e+0 

4704 

2 

5.7e-8 

264 

le+5 

2/ 

1/5 

9.2e-7 

380 

le+5 

2/1/5 

Ga3As3. 

1.9e-8 

353 

5e+3 

2.9e-l 

11738 

21 

6.4e-8 

101 

7e+4 

2/ 

1/5 

5.3e-7 

136 

6e+4 

2/1/5 

shallo. 

1.5e-10 

774 

8e+3 

5.2e-9 

69 

4 

4.9e-9 

207 

2e+5 

2/ 

1/7 

9.2e-8 

207 

le+5 

2/ 1/7 

SilOHl. 

5.6e-7 

10 

2e+3 

2.6e-10 

84 

3 

5.2e-9 

11 

4e+4 

2/ 

1/9 

1.2e-10 

11 

3e+4 

2/ 1/9 

Si5H12 

1.5e-12 

13 

2e+3 

1.2e-8 

170 

3 

1.0e-10 

10 

3e+4 

2/ 

1/6 

4.6e-8 

12 

3e+4 

2/1/6 

SiO 

1.4e-13 

58 

3e+3 

4. le-7 

265 

2 

1.4e-8 

23 

4e+4 

2/ 

1/5 

4. le-7 

29 

4e+4 

2/ 1/5 

wathen. 

5.5e-14 

39 

2e+3 

6.0e-8 

11 

4 

l.le-6 

10 

2e+4 

1/ 

1/3 

6.9e-ll 

26 

4e+4 

2/ 1/5 

Ge87H7. 

1.7e-8 1451 

8e+3 

5.3e-9 

8352 

3 

6.5e-10 

439 

2e+5 

2/ 

1/6 

1.2e-7 

392 

le+5 

2/1/6 

Ge99H 1. 

2.5e-14 1636 

8e+3 

5.6e-7 

6119 

2 

2.3e-9 

348 

le+5 

2/ 

1/6 

7.4e-8 

402 

le+5 

2/1/6 

Si41Ge. 

l.le-8 2909 

9e+3 

3.9e-7 

14929 

2 

1.6e-9 

863 

2e+5 

2/ 

1/7 

5.8e-8 

708 

le+5 

2/ 1/7 

Si87H7. 

3.5e-14 3568 

le+4 

2.8e-l 

1702 

1 

4.0e-9 1126 

3e+5 

2/ 

1/7 

1.le-7 

882 

le+5 

2/ 1/7 

Ga41As. 

7.4e-14 4100 

le+4 

8.6e-l 

1066 

1 

1.2e-10 1029 

2e+5 

3/ 

1/5 

2. le-7 1028 

le+5 

2/ 1/7 

name 

maxres 

time SpMV 

maxres 

time RR 

maxres 

time SpMV RR/p/d 

maxres 

time SpMV RR/p/d 





tol= 

e-12 











Andrew. 

5.6e-14 

232 

4e+3 

4.7e-14 

489 

9 

8.8e-13 

148 

le+5 

5/2/5 

1.0e-12 

125 

8e+4 

4/1/5 

C60 

6.3e-13 

15 

2e+3 

2.8e-13 

89 

5 

2.0e-12 

13 

5e+4 

3/ 1/9 

5.5e-12 

13 

4e+4 

3/ 1/9 

cfdl 

2.5e-14 

296 

3e+3 

7.1e-14 

204 

8 

9.8e-13 

190 

8e+4 

4/ 1/3 

9.9e-13 

188 

6e+4 

3/ 1/3 

financ. 

2.1e-14 

283 

3e+3 

2.1e-14 

67 

5 

1.3e-12 

97 

5e+4 

3/2/3 

l.le-12 

69 

3e+4 

1/ 1/3 

GalOAs. 

4.8e-14 1784 

8e+3 

1.6e+0 

4631 

2 

9.6e-13 

442 

2e+5 

4/ 1/5 

9.7e-13 

580 

2e+5 

4/1/6 

Ga3As3. 

2.1e-14 

419 

5e+3 

2.9e-l 

11245 

21 

1.7e-12 

169 

le+5 

4/ 1/5 

1.7e-12 

198 

le+5 

4/1/5 

shallo. 

4.6e-13 

768 

8e+3 

1.9e-13 

121 

7 

1.0e-12 

234 

2e+5 

4/ 1/7 

9.9e-13 

280 

2e+5 

4/1/7 

SilOHl. 

5.3e-14 

11 

2e+3 

4.0e-13 

104 

4 

6.2e-13 

10 

3e+4 

2/ 1/9 

3.7e-14 

12 

3e+4 

3/ 1/9 

Si5H12 

l.le-14 

15 

2e+3 

2.6e-13 

259 

5 

9.5e-13 

11 

3e+4 

2/ 1/6 

5.3e-12 

15 

3e+4 

3/ 1/6 

SiO 

1.4e-14 

58 

3e+3 

4.7e-13 

533 

4 

9.8e-13 

33 

5e+4 

3/ 1/5 

1.4e-12 

45 

6e+4 

4/1/5 

wathen. 

4.3e-14 

36 

2e+3 

5.1e-14 

24 

8 

l.le-12 

19 

4e+4 

2/ 1/5 

9.8e-13 

30 

4e+4 

3/ 1/5 

Ge87H7. 

2.8e-14 1524 

8e+3 

1.3e-13 13993 

5 

4.8e-12 

435 

2e+5 

3/ 1/6 

1.0e-12 

523 

2e+5 

4/1/6 

Ge99Hl. 

8.4e-14 1563 

8e+3 

2.1e-14 13438 

5 

3.7e-12 

395 

2e+5 

2/ 1/6 

9.6e-13 

569 

2e+5 

4/1/6 

Si41Ge. 

2.6e-14 2991 

9e+3 

2.5e-14 35270 

5 

9.9e-13 

865 

2e+5 

3/ 1/7 

l.le-12 

954 

2e+5 

3/ 1/7 

Si87H7. 

2.8e-14 3506 

le+4 

2.8e-l 

1924 

1 

1.0e-12 1018 

2e+5 

3/ 1/7 

1.4e-12 1102 

2e+5 

3/ 1/7 

Ga41As. 

7.5e-14 4103 

le+4 

8.6e-l 

1242 

1 

7.9e-13 1135 

2e+5 

3/ 1/7 

3.7e-12 1366 

2e+5 

3/ 1/7 


is accomplished mainly through the use of three techniques: (1) an augmented Rayleigh- 
Ritz (ARR) procedure that can provably accelerate convergence under mild conditions; (2) 
a set of easy-to-control, low-degree polynomial accelerators; and (3) a bold stoping rule for 
SU steps that essentially allows an iterate matrix to become numerically rank-deficient. Of 
course, the success of our construction also depends greatly on a set of carefully integrated 
algorithmic details. The resulting algorithm is named ARRABIT, which uses A only in matrix 
multiplications. 

Numerical experiments in Matlab on sixteen test matrices from the UF Sparse Matrix 
Collection show, convincingly in our view, that the accuracy and efficiency of ARRABIT is in¬ 
deed competitive to start-of-the-art eigensolvers. Exceeding our expectations, ARRABIT can 
already provide multi-fold speedups over the benchmark solver EIGS, without explicit code 
parallelization and without running on massively parallel machines, on difficult problems. In 
particular, it often only needs two or three, sometimes just one, ARR projections to reach a 
good solution accuracy. 

There are a number of future directions worth pursuing from this point on. For one thing, 
the robustness and efficiency of ARRABIT can be further enhanced by refining its construction 
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Table 7.6 

Comparison results on computing k smallest eigenpairs 




EIGS 


FEAST 



MPM 


GN 


name 

maxres 

time SpMV 

maxres 

time RR 

maxres 

time SpMV RR/p/d 

maxres 

time SpMV 

RR/p/d 





tol= 

e-6 









Andrew. 

4.9e-7 

399 

7e+3 

8.5e-8 

219 

4 

4.2e-7 

113 

le+5 2/1/8 

5.2e-9 

168 

le+5 

3/ 1/8 

C60 

2.2e-13 

8 

2e+3 

6.4e-5 

291 

16 

9.6e-7 

16 

5e+4 4/2/6 

2.4e-6 

9 

2e+4 

3/ 1/3 

cfdl 

4.7e-9 

3871 

6e+4 

4.2e-8 

167 

7 

3.4e-7 

601 

7e+5 2/ 1/ 15 

4.8e-6 

614 

6e+5 

5/ 2/ 15 

financ. 

1.2e-9 

1563 

2e+4 

4.5e-8 

51 

4 

1.7e-6 

338 

4e+5 2/1/ 15 

5.3e-9 

379 

3e+5 

3/ 1/ 15 

GalOAs. 

2.9e-12 

2740 

2e+4 

8.9e-9 

9302 

4 

6.2e-6 

751 

3e+5 2/1/8 

1.8e-6 

715 

2e+5 

2/ 1/7 

Ga3As3. 

1.7e-12 

599 

8e+3 

7.3e-8 

1837 

3 

6.9e-6 

325 

2e+5 2/1/9 

1.7e-9 

282 

2e+5 

3/ 1/9 

shallo. 

3.8e-14 

1614 

2e+4 

6.1e-8 

69 

4 

2.0e-8 

400 

4e+5 2/ 1/ 14 

4.1e-6 

261 

2e+5 

2/ 1/9 

SilOHl. 

1.5e-7 

14 

2e+3 

1.2e-7 

121 

4 

2.8e-7 

13 

5e+4 2/1/8 

7.1e-6 

12 

3e+4 

2/ 1/8 

Si5H12 

5.8e-12 

21 

3e+3 

1.5e-8 

166 

3 

3.3e-7 

14 

4e+4 2/1/8 

6.5e-6 

15 

3e+4 

2/ 1/8 

SiO 

2.7e-13 

97 

5e+3 

5.6e-8 

537 

4 

4. le-7 

46 

9e+4 2/1/8 

8.8e-10 

57 

9e+4 

3/ 1/8 

wathen. 

1.4e-9 

118 

8e+3 

8.4e-8 

10 

4 

8.2e-6 

61 

2e+5 2/ 1/ 15 

2.4e-7 

63 

le+5 

3/ 1/ 15 

GeS7H7. 

2.0e-13 

2559 

le+4 

2.7e-8 11268 

4 

4.8e-7 

509 

3e+5 2/1/9 

8.1e-10 

641 

2e+5 

3/ 1/9 

Ge99Hl. 

2. le-11 

2319 

le+4 

1.0e-8 11892 

4 

4.8e-7 

568 

3e+5 2/1/9 

2.0e-6 

564 

2e+5 

2/ 1/8 

Si41Ge. 

4.1e-9 

4650 

le+4 

1.2e-8 25658 

4 

6.3e-7 1102 

3e+5 2/1/11 

4.1e-10 1361 

3e+5 

3/1/11 

Si87H7. 

3.0e-13 

5458 

2e+4 

3.3e+0 

1842 

1 

3.2e-6 1201 

3e+5 2/1/11 

7.4e-6 1243 

2e+5 

2/ 1/ 10 

Ga41As. 

3.6e-7 32279 

8e+4 

8.6e-l 

1095 

1 

2.1e-8 3166 

5e+5 3/1/11 

1.3e-6 3193 

4e+5 

3/2/11 

name 

maxres 

time SpMV 

maxres 

time RR 

maxres 

time SpMV RR/p/d 

maxres 

time SpMV 

RR/p/d 

tol=le-12 

Andrew. 

1.2e-13 

422 

7e+3 

4.1e-13 

361 

7 

9.7e-13 

200 

2e+5 4/1/8 

l.le-12 

185 

2e+5 

5/ 1/8 

C60 

2.6e-14 

9 

2e+3 

6.4e-6 

358 

21 

2.8e-12 

26 

7e+4 9/3/6 

9.2e-13 

15 

4e+4 

8/2/6 

cfdl 

2.9e-14 

4209 

6e+4 

5.5e-14 

383 

16 

8.9e-12 

719 

9e+5 4/ 1/ 15 

4.2e-12 1017 

le+6 12/ 3/ 15 

financ. 

9.7e-13 

1776 

2e+4 

5.5e-14 

93 

8 

1.4e-12 

600 

7e+5 6/ 1/ 15 

3.4e-12 

467 

4e+5 

5/ 1/ 15 

GalOAs. 

2.8e-12 

3479 

2e+4 

9.4e-14 17251 

7 

l.le-12 1558 

7e+5 6/1/8 

4.0e-12 

963 

3e+5 

3/ 1/9 

Ga3As3. 

1.2e-12 

571 

8e+3 

3.8e-13 

2908 

5 

9.9e-13 

420 

3e+5 4/1/9 

9.5e-13 

338 

2e+5 

5/ 1/9 

shallo. 

3.9e-14 

1532 

2e+4 

2.7e-13 

126 

8 

3.2e-12 

600 

6e+5 5/ 1/ 12 

4.0e-13 

505 

4e+5 

5/ 1/ 14 

SilOHl. 

7.9e-14 

18 

2e+3 

2.1e-12 

198 

7 

2.0e-12 

16 

5e+4 4/1/8 

3.9e-13 

20 

5e+4 

5/ 1/8 

Si5H12 

1.5e-13 

22 

3e+3 

3.6e-14 

228 

5 

2.1e-12 

20 

6e+4 4/ 1/ 8 

9.6e-12 

23 

6e+4 

4/ 1/8 

SiO 

2.7e-13 

93 

5e+3 

2.7e-13 

915 

7 

6.0e-13 

64 

le+5 5/1/8 

9.4e-13 

68 

le+5 

5/ 1/8 

wathen. 

8.2e-13 

146 

8e+3 

1.0e-13 

18 

7 

3.1e-12 

163 

5e+5 6/2/15 

1.5e-12 

120 

3e+5 

7/ 2/ 15 

Ge87H7. 

1.8e-13 

2250 

le+4 

1.5e-13 18852 

7 

2.6e-13 

892 

4e+5 5/1/9 

9.9e-13 

765 

3e+5 

5/ 1/9 

Ge99Hl. 

1.8e-13 

2353 

le+4 

6.7e-14 17683 

7 

9.7e-13 

986 

5e+5 4/ 1/ 9 

9.9e-13 

804 

3e+5 

4/ 1/9 

Si41Ge. 

3.3e-13 

4656 

2e+4 

1.3e-13 46386 

7 

9.9e-12 1705 

5e+5 4/1/11 

9.8e-13 1568 

3e+5 

5/1/11 

Si87H7. 

3.0e-13 

5487 

2e+4 

3.3e+0 

1854 

1 

l.le-12 2284 

6e+5 6/1/11 

l.le-12 1960 

4e+5 

5/1/11 

Ga41As. 

5.3e-12 33254 

8e+4 

8.6e-l 

998 

1 

8.8e-13 5700 

le+6 7/2/11 

1.7e-12 3913 

5e+5 

5/ 2/ 12 


and and tuning its parameters. Software development and an evaluation of its parallel scala¬ 
bility are certainly important. The prospective of extending the algorithm to non-Hermitian 
matrices and the generalized eigenvalue problem looks promising. Overall, we feel that the 
present work has laid a solid foundation for these and other future activities. 
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